SlideShare a Scribd company logo
1 of 20
Semantic Validation
For Kafka® Data
Quality
Diwei Jiang, Xinli
Shang
Uber | Kafka London Summit 2024
Speaker Introduction
● Diwei Jiang
○ Senior Software Engineer @ Uber Streaming Data
● Xinli Shang
○ Senior Engineering Manager @ Uber Streaming Data
○ Apache® Parquet PMC chair, Presto® committer
Uber | Kafka London Summit 2024
Agenda
● Uber Kafka & Data Lake architecture
● Motivation
● Semantic Validation
● Use cases in both Streaming and Data Lake
● Future work
Uber Streaming & Data Lake Architecture
Ingestion
Online Storage
Events
Telemetry
Feeds
Kafka Data
Lake
Compute Fabric
Real-Time Analytics
Data Platform & Tools
Batch Analytics
Stream Processing
Complex Processing
Data Workflow
(Piper, uWorc)
BI Tools
(QueryBuilder, Dashbuilder)
Metadata Platform
(Databook, Quality, Lineage)
Interactive ETL
In-memory (Pinot)
storage
Security
Global Data
Warehouse
1000 services
Uber Data Flows
● Catastrophic impact to
business
● Difficult to detect on timely
● Recovery process is costly
Corrupted Data is Poison Pill
Semantic Validation
What’s Semantic Validation?
Verifies the content of the data being transmitted through Kafka topics.
Example types of Constraints:
● Number Constraint:
○ eg: Payment amount, Age
● String Constraint:
○ eg: Product name length, Address format
● Platform Integration & reusability
○ Consistent with existing schema evolution flow.
○ Centralize validation flows.
● User Customizations
○ Provide users with the flexibility to customize
validation behavior and configure alerting.
● Timely Detection
○ Validate on Producer side before data enters
kafka.
Design Goals
Current Enforcement Limitations
● Current Checks Limitations:
● Relying on code application checks to verify data integrity can be insufficient.
● Often, validations in code are implemented downstream are reactive fixes
post-outage.
● Absence of Built-in Support in Avro:
● Avro lacks native mechanisms for expressing semantic constraints within
schemas.
● Custom validation outside Avro leads to inconsistency and complexity in data
pipelines.
Architecture
- Teams can easily access their schema and update constraints.
- Application services depend on producer client to fetch schema and validate.
- Validator will emit metrics for failed data and monitoring system will send out alert.
UI & Schema Evolution
● User create constraint on fields
● Frontend validate format
● Constraint change -> version
change
Constraint Examples
● Numeric type ● String type
Future plan, adding custom constraints for a
shared object (eg: BillingEntry) allows
centralized validation on same object across
schemas, the object level validation design is
work in progress.
Reusing Constraints
● Predefined constraints ● Object level constraints
The address regex is predefined in schema
backend.
Encoding and Validation
● Validating during
encoding
● Different rules for
each data type
● Sampling mechanism
● Each record
encoding P99 latency
with validation is
~130 μs, without
validation ~100 μs
Open Questions #1
Should we drop the bad data directly?
Here’s trade-offs of each:
○ Drop invalid data : prevent bad data but will cause data loss
○ Alert only: non disruptive approach, won’t prevent polluted data flow
in
○ Setting up DLQ for producer: increased maintenance cost
○ Insert a new header: delegate to consumers to identify polluted
data.
Decision: we chose to make it opt-in configuration if user wants to discard data
directly, otherwise we’re creating alerts only for our 1st phase.
Open Questions #2
Backward compatibility for constraints update:
Day 0, user sets constraints to be a range (0-100)
Day 1, users updated constraints to be (0-90)
Now data with value of 95 which is not considered valid anymore. Do we allow
this change when user update schema?
- If a topic has multiple producers, one of them with latest schema may
start to trigger more violation errors causing inconsistency
- We decided to allow this for first phase but warning user when they
update schema.
Semantic Validation for both Online and Offline
- Offline paths can
extend validator
logic upon consume
- This allow each
consumer pipeline
flexibility to configure
different behavior
Limitations
Sampling cannot guarantee thorough validation.
● Backpressure based on capacity in realtime to try to maximize sample with low
latency
● Progressive validation when error pattern trends emerge.
● Auditing service to consume topic and perform comprehensive validation
Future Work
● Productionize it
● Upstream to OSS
● Dynamic sampling
● Comprehensive auditing
● Reusable constraints, cross field constraints
Q & A
Send questions to: shangxinli@apache.org

More Related Content

Similar to Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification

Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko
 
Incremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical SystemsIncremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical SystemsÁkos Horváth
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Paul Brebner
 
Laravel Load Testing: Strategies and Tools
Laravel Load Testing: Strategies and ToolsLaravel Load Testing: Strategies and Tools
Laravel Load Testing: Strategies and ToolsMuhammad Shehata
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance TestingMark Price
 
Aakash shah performance tester
Aakash shah  performance testerAakash shah  performance tester
Aakash shah performance testeranandkayalmatrix
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity Software Ireland
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesKnoldus Inc.
 
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...TEST Huddle
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSDeepak Shankar
 
Modern Testing Strategies for Evolving Ecosystems
Modern Testing Strategies for Evolving EcosystemsModern Testing Strategies for Evolving Ecosystems
Modern Testing Strategies for Evolving EcosystemsJulian Warszawski
 
From Continuous to Autonomous Testing with AI
From Continuous to Autonomous Testing with AIFrom Continuous to Autonomous Testing with AI
From Continuous to Autonomous Testing with AICognizant
 

Similar to Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification (20)

Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Performance testing material
Performance testing materialPerformance testing material
Performance testing material
 
Resume
ResumeResume
Resume
 
Incremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical SystemsIncremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical Systems
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...
 
Laravel Load Testing: Strategies and Tools
Laravel Load Testing: Strategies and ToolsLaravel Load Testing: Strategies and Tools
Laravel Load Testing: Strategies and Tools
 
Amita_Kashyap1_CV
Amita_Kashyap1_CVAmita_Kashyap1_CV
Amita_Kashyap1_CV
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Aakash shah performance tester
Aakash shah  performance testerAakash shah  performance tester
Aakash shah performance tester
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
 
Amita_Kashyap_CV
Amita_Kashyap_CVAmita_Kashyap_CV
Amita_Kashyap_CV
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data Pipelines
 
JMeter
JMeterJMeter
JMeter
 
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
 
Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017Dynomite @ RedisConf 2017
Dynomite @ RedisConf 2017
 
9 Yrs Manual and Selenium Testing Profile
9 Yrs Manual and Selenium Testing Profile9 Yrs Manual and Selenium Testing Profile
9 Yrs Manual and Selenium Testing Profile
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
 
Modern Testing Strategies for Evolving Ecosystems
Modern Testing Strategies for Evolving EcosystemsModern Testing Strategies for Evolving Ecosystems
Modern Testing Strategies for Evolving Ecosystems
 
From Continuous to Autonomous Testing with AI
From Continuous to Autonomous Testing with AIFrom Continuous to Autonomous Testing with AI
From Continuous to Autonomous Testing with AI
 
Shuvam dutta
Shuvam duttaShuvam dutta
Shuvam dutta
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification

  • 1. Semantic Validation For Kafka® Data Quality Diwei Jiang, Xinli Shang
  • 2. Uber | Kafka London Summit 2024 Speaker Introduction ● Diwei Jiang ○ Senior Software Engineer @ Uber Streaming Data ● Xinli Shang ○ Senior Engineering Manager @ Uber Streaming Data ○ Apache® Parquet PMC chair, Presto® committer
  • 3. Uber | Kafka London Summit 2024 Agenda ● Uber Kafka & Data Lake architecture ● Motivation ● Semantic Validation ● Use cases in both Streaming and Data Lake ● Future work
  • 4. Uber Streaming & Data Lake Architecture Ingestion Online Storage Events Telemetry Feeds Kafka Data Lake Compute Fabric Real-Time Analytics Data Platform & Tools Batch Analytics Stream Processing Complex Processing Data Workflow (Piper, uWorc) BI Tools (QueryBuilder, Dashbuilder) Metadata Platform (Databook, Quality, Lineage) Interactive ETL In-memory (Pinot) storage Security Global Data Warehouse 1000 services
  • 6. ● Catastrophic impact to business ● Difficult to detect on timely ● Recovery process is costly Corrupted Data is Poison Pill
  • 7. Semantic Validation What’s Semantic Validation? Verifies the content of the data being transmitted through Kafka topics. Example types of Constraints: ● Number Constraint: ○ eg: Payment amount, Age ● String Constraint: ○ eg: Product name length, Address format
  • 8. ● Platform Integration & reusability ○ Consistent with existing schema evolution flow. ○ Centralize validation flows. ● User Customizations ○ Provide users with the flexibility to customize validation behavior and configure alerting. ● Timely Detection ○ Validate on Producer side before data enters kafka. Design Goals
  • 9. Current Enforcement Limitations ● Current Checks Limitations: ● Relying on code application checks to verify data integrity can be insufficient. ● Often, validations in code are implemented downstream are reactive fixes post-outage. ● Absence of Built-in Support in Avro: ● Avro lacks native mechanisms for expressing semantic constraints within schemas. ● Custom validation outside Avro leads to inconsistency and complexity in data pipelines.
  • 10. Architecture - Teams can easily access their schema and update constraints. - Application services depend on producer client to fetch schema and validate. - Validator will emit metrics for failed data and monitoring system will send out alert.
  • 11. UI & Schema Evolution ● User create constraint on fields ● Frontend validate format ● Constraint change -> version change
  • 12. Constraint Examples ● Numeric type ● String type
  • 13. Future plan, adding custom constraints for a shared object (eg: BillingEntry) allows centralized validation on same object across schemas, the object level validation design is work in progress. Reusing Constraints ● Predefined constraints ● Object level constraints The address regex is predefined in schema backend.
  • 14. Encoding and Validation ● Validating during encoding ● Different rules for each data type ● Sampling mechanism ● Each record encoding P99 latency with validation is ~130 μs, without validation ~100 μs
  • 15. Open Questions #1 Should we drop the bad data directly? Here’s trade-offs of each: ○ Drop invalid data : prevent bad data but will cause data loss ○ Alert only: non disruptive approach, won’t prevent polluted data flow in ○ Setting up DLQ for producer: increased maintenance cost ○ Insert a new header: delegate to consumers to identify polluted data. Decision: we chose to make it opt-in configuration if user wants to discard data directly, otherwise we’re creating alerts only for our 1st phase.
  • 16. Open Questions #2 Backward compatibility for constraints update: Day 0, user sets constraints to be a range (0-100) Day 1, users updated constraints to be (0-90) Now data with value of 95 which is not considered valid anymore. Do we allow this change when user update schema? - If a topic has multiple producers, one of them with latest schema may start to trigger more violation errors causing inconsistency - We decided to allow this for first phase but warning user when they update schema.
  • 17. Semantic Validation for both Online and Offline - Offline paths can extend validator logic upon consume - This allow each consumer pipeline flexibility to configure different behavior
  • 18. Limitations Sampling cannot guarantee thorough validation. ● Backpressure based on capacity in realtime to try to maximize sample with low latency ● Progressive validation when error pattern trends emerge. ● Auditing service to consume topic and perform comprehensive validation
  • 19. Future Work ● Productionize it ● Upstream to OSS ● Dynamic sampling ● Comprehensive auditing ● Reusable constraints, cross field constraints
  • 20. Q & A Send questions to: shangxinli@apache.org