SlideShare a Scribd company logo
CI/CD with an
Idempotent Kafka
Producer & Consumer
Eden Ohana
SW Engineer, Treeverse
eden.ohana@treeverse.io
April 2022
Table of Contents
Data lakes Continuous data
streaming
Git for data Kafka with lakeFS
Demo
Data Lakes Today - Video
Deployment
DATA
Data Source
Data Source
Data Consumer
Data Consumer
ETL ETL
Data Lakes Today
Streaming Data with Kafka Connect
Kafka In Production
Challenges with Continuous Data Streaming
● In-ability to experiment, compare and reproduce
● Troubleshoot producer issues
● Data quality tests
● Cross collection consistency, synchronize kafka topics
Idempotency and data lakes problems
● The idempotent producer solves the problem of duplicate messages and
provides the 'exactly-once-delivery'
● Exactly once delivery requires tight operation of multiple parts in the data
pipeline
● Data quality is at risk during CI/CD of one component of the data pipeline
How is it solved today?
● Duplicate data, maintain sandbox
● Real time data quality tests
● Object level versioning
● Readers  Writers separation
Production
Deployment Development
DATA
How do you
test new /
modified
pipelines?
What
happens
when
production
data gets
corrupted?
How do you
guarantee
data quality?
Data Lakes Today
we would manage data
from dev to production
the way we manage code
In a perfect world
s3://data-repo/collections/foo
s3://data-repo/main/collections/foo
lakectl branch create 
lakefs://repo@testing-spark-3 
--source lakefs://repo@main
# output:
# created branch 'testing-spark-3.0',
# pointing to commit ID: 'd1e9adc71c10a’
Object Store
Manage Data like Code With lakeFS
Why Git for Data
Challenge Solution
$ lakectl revert main^1
$ lakectl branch create my-branch
$ spark.read.parquet
(‘s3://my-repo/<commit_id>’)
$ lakectl merge my-branch main
$ lakectl branch create new-logic
$ lakectl merge new-logic main
Quickly recover from an error
Develop in Isolation
Troubleshoot
Atomic Update
Reprocess
Production
Deployment Development
DATA
Safely test
your pipelines
on an
production
identical
isolated
branch
Easily revert
corrupted data.
Time travel to
debug a snapshot
Production
identical
isolated
branch and
hooks to test
your data
Data Lake With lakeFS
How it Works
lakeFS Architecture
Data Model
Commit
Meta range ID: ...
Master
Commit ID: ...
8-12MB .sst files,
containing continuous
sorted keys
.sst file,
pointing to
range files
Further Reading
How does lakeFS Work?
Kafka with lakeFS
What Does A Typical Environment Look Like?
In Dev
In Development
● Experimentation - try tools and code in isolation
● Debug - create an isolated snapshot of the data at the time of the failure
● Collaborate - tools, code or different versions of your data
In Deployment
In Deployment
● Version control - manage data interfaces with clear versions and easily point
the consumers to newly deployed data.
● Test - define pre-merge and pre-commit hooks to run tests that enforce best
practices.
prevent breaking changes from entering your production data environment
In Production
In Production
● Roll Back - recover from errors by instantly reverting data to a former,
consistent snapshot of the data lake.
● Troubleshoot - investigate production errors by starting with a snapshot of the
inputs to the failed process
● Cross-collection Consistency - provide consumers multiple synchronized
collections of data in one atomic, revertible action
Integrating Kafka with LakeFS
# Your lakeFS repository
s3.bucket.name=example-repo
# Your lakeFS S3 endpoint and credentials
store.url=https://s3.lakefs.example.com
aws.access.key.id=AKIAIOSFODNN7EXAMPLE
aws.secret.access.key=wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY
# stream-1-branch being the branch we want to write to
topics.dir=stream-1-branch/topics
Lakectl branch revert lakefs://example-repo@stream-1-branch --commit dd8a60d5ef70809
Demo
Thanks!
Join the community eden.ohana@treeverse.io
GitHub
lakeFS.io
Check out the docs.lakefs.io
Contribute and star on Github:
Treeverse/lakeFS

More Related Content

What's hot

ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Saroj Panyasrivanit
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
HostedbyConfluent
 
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracingTracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Yuri Shkuro
 
stackconf 2022: Open Source for Better Observability
stackconf 2022: Open Source for Better Observabilitystackconf 2022: Open Source for Better Observability
stackconf 2022: Open Source for Better Observability
NETWAYS
 
Clean Infrastructure as Code
Clean Infrastructure as Code Clean Infrastructure as Code
Clean Infrastructure as Code
QAware GmbH
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Flink Forward
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
Kevin Brockhoff
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
Araf Karsh Hamid
 
Terraform Basics
Terraform BasicsTerraform Basics
Terraform Basics
Mohammed Fazuluddin
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
Michal Cwienczek
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
Mirantis
 
Docker compose
Docker composeDocker compose
Docker compose
Felipe Ruhland
 
CICD Pipelines for Microservices Best Practices
CICD Pipelines for Microservices Best Practices CICD Pipelines for Microservices Best Practices
CICD Pipelines for Microservices Best Practices
Codefresh
 

What's hot (20)

ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracingTracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
 
stackconf 2022: Open Source for Better Observability
stackconf 2022: Open Source for Better Observabilitystackconf 2022: Open Source for Better Observability
stackconf 2022: Open Source for Better Observability
 
Clean Infrastructure as Code
Clean Infrastructure as Code Clean Infrastructure as Code
Clean Infrastructure as Code
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
 
Terraform Basics
Terraform BasicsTerraform Basics
Terraform Basics
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
 
Docker compose
Docker composeDocker compose
Docker compose
 
CICD Pipelines for Microservices Best Practices
CICD Pipelines for Microservices Best Practices CICD Pipelines for Microservices Best Practices
CICD Pipelines for Microservices Best Practices
 

Similar to CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Gs08 modernize your data platform with sql technologies wash dc
Gs08 modernize your data platform with sql technologies   wash dcGs08 modernize your data platform with sql technologies   wash dc
Gs08 modernize your data platform with sql technologies wash dc
Bob Ward
 
Database CI/CD Pipeline
Database CI/CD PipelineDatabase CI/CD Pipeline
Database CI/CD Pipeline
muhammadhashir57
 
Ankit Chohan - Java
Ankit Chohan - JavaAnkit Chohan - Java
Ankit Chohan - Java
Ankit Chohan
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Kevin Slade - CV
Kevin Slade - CVKevin Slade - CV
Kevin Slade - CV
Kevin Slade
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
confluent
 
Perfsystems- Consulting Services
Perfsystems- Consulting ServicesPerfsystems- Consulting Services
Perfsystems- Consulting Services
Perfsys Tems
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'
Kyle Hailey
 
Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
Kellyn Pot'Vin-Gorman
 
SQL Saturday San Diego
SQL Saturday San DiegoSQL Saturday San Diego
SQL Saturday San Diego
Kellyn Pot'Vin-Gorman
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System	Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System
Great Wide Open
 
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Fwdays
 
Database CI Demo Using Sql Server
Database CI  Demo Using Sql ServerDatabase CI  Demo Using Sql Server
Database CI Demo Using Sql Server
Umesh Kumar
 
Achieving Full Stack DevOps at Colonial Life
Achieving Full Stack DevOps at Colonial Life Achieving Full Stack DevOps at Colonial Life
Achieving Full Stack DevOps at Colonial Life
DevOps.com
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
Syntergy upgrade open text content server with replicator - 7-3-2016
Syntergy   upgrade open text content server with replicator - 7-3-2016Syntergy   upgrade open text content server with replicator - 7-3-2016
Syntergy upgrade open text content server with replicator - 7-3-2016
Vijay Sharma
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 

Similar to CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022 (20)

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Gs08 modernize your data platform with sql technologies wash dc
Gs08 modernize your data platform with sql technologies   wash dcGs08 modernize your data platform with sql technologies   wash dc
Gs08 modernize your data platform with sql technologies wash dc
 
Database CI/CD Pipeline
Database CI/CD PipelineDatabase CI/CD Pipeline
Database CI/CD Pipeline
 
Ankit Chohan - Java
Ankit Chohan - JavaAnkit Chohan - Java
Ankit Chohan - Java
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Kevin Slade - CV
Kevin Slade - CVKevin Slade - CV
Kevin Slade - CV
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Perfsystems- Consulting Services
Perfsystems- Consulting ServicesPerfsystems- Consulting Services
Perfsystems- Consulting Services
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'
 
Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
 
SQL Saturday San Diego
SQL Saturday San DiegoSQL Saturday San Diego
SQL Saturday San Diego
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System	Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System
 
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
 
Database CI Demo Using Sql Server
Database CI  Demo Using Sql ServerDatabase CI  Demo Using Sql Server
Database CI Demo Using Sql Server
 
Achieving Full Stack DevOps at Colonial Life
Achieving Full Stack DevOps at Colonial Life Achieving Full Stack DevOps at Colonial Life
Achieving Full Stack DevOps at Colonial Life
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Syntergy upgrade open text content server with replicator - 7-3-2016
Syntergy   upgrade open text content server with replicator - 7-3-2016Syntergy   upgrade open text content server with replicator - 7-3-2016
Syntergy upgrade open text content server with replicator - 7-3-2016
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 

Recently uploaded (20)

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 

CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022

  • 1. CI/CD with an Idempotent Kafka Producer & Consumer Eden Ohana SW Engineer, Treeverse eden.ohana@treeverse.io April 2022
  • 2. Table of Contents Data lakes Continuous data streaming Git for data Kafka with lakeFS Demo
  • 4. Deployment DATA Data Source Data Source Data Consumer Data Consumer ETL ETL Data Lakes Today
  • 5. Streaming Data with Kafka Connect
  • 7. Challenges with Continuous Data Streaming ● In-ability to experiment, compare and reproduce ● Troubleshoot producer issues ● Data quality tests ● Cross collection consistency, synchronize kafka topics
  • 8. Idempotency and data lakes problems ● The idempotent producer solves the problem of duplicate messages and provides the 'exactly-once-delivery' ● Exactly once delivery requires tight operation of multiple parts in the data pipeline ● Data quality is at risk during CI/CD of one component of the data pipeline
  • 9. How is it solved today? ● Duplicate data, maintain sandbox ● Real time data quality tests ● Object level versioning ● Readers Writers separation
  • 10. Production Deployment Development DATA How do you test new / modified pipelines? What happens when production data gets corrupted? How do you guarantee data quality? Data Lakes Today
  • 11. we would manage data from dev to production the way we manage code In a perfect world
  • 12. s3://data-repo/collections/foo s3://data-repo/main/collections/foo lakectl branch create lakefs://repo@testing-spark-3 --source lakefs://repo@main # output: # created branch 'testing-spark-3.0', # pointing to commit ID: 'd1e9adc71c10a’ Object Store Manage Data like Code With lakeFS
  • 13. Why Git for Data Challenge Solution $ lakectl revert main^1 $ lakectl branch create my-branch $ spark.read.parquet (‘s3://my-repo/<commit_id>’) $ lakectl merge my-branch main $ lakectl branch create new-logic $ lakectl merge new-logic main Quickly recover from an error Develop in Isolation Troubleshoot Atomic Update Reprocess
  • 14. Production Deployment Development DATA Safely test your pipelines on an production identical isolated branch Easily revert corrupted data. Time travel to debug a snapshot Production identical isolated branch and hooks to test your data Data Lake With lakeFS
  • 17. Data Model Commit Meta range ID: ... Master Commit ID: ... 8-12MB .sst files, containing continuous sorted keys .sst file, pointing to range files Further Reading
  • 20. What Does A Typical Environment Look Like?
  • 21. In Dev In Development ● Experimentation - try tools and code in isolation ● Debug - create an isolated snapshot of the data at the time of the failure ● Collaborate - tools, code or different versions of your data
  • 22. In Deployment In Deployment ● Version control - manage data interfaces with clear versions and easily point the consumers to newly deployed data. ● Test - define pre-merge and pre-commit hooks to run tests that enforce best practices. prevent breaking changes from entering your production data environment
  • 23. In Production In Production ● Roll Back - recover from errors by instantly reverting data to a former, consistent snapshot of the data lake. ● Troubleshoot - investigate production errors by starting with a snapshot of the inputs to the failed process ● Cross-collection Consistency - provide consumers multiple synchronized collections of data in one atomic, revertible action
  • 24. Integrating Kafka with LakeFS # Your lakeFS repository s3.bucket.name=example-repo # Your lakeFS S3 endpoint and credentials store.url=https://s3.lakefs.example.com aws.access.key.id=AKIAIOSFODNN7EXAMPLE aws.secret.access.key=wJalrXUtnFEMIK7MDENGbPxRfiCYEXAMPLEKEY # stream-1-branch being the branch we want to write to topics.dir=stream-1-branch/topics Lakectl branch revert lakefs://example-repo@stream-1-branch --commit dd8a60d5ef70809
  • 25. Demo
  • 26. Thanks! Join the community eden.ohana@treeverse.io GitHub lakeFS.io Check out the docs.lakefs.io Contribute and star on Github: Treeverse/lakeFS