SlideShare a Scribd company logo
1 of 24
Download to read offline
CI/CD for Data –
Building Dev/Test Data
Environments With lakeFS
Vino SD, Developer Advocate
About Me
■ I am Vino Duraisamy.
■ SWE -> Data/ML Engineer -> Developer Advocate
■ Open-source contributor @lakeFS
@vinodhini_sd @vinodhini-sd
vinodhini-sd.medium.com
A Day in the Life of a Data Engineer!
Expectation Reality
What Can Go Wrong?
■ EMR/Spark upgrade
■ Schema changes
■ Change in business logic
■ Infrastructure changes
■ Troubleshooting failed spark jobs
■ Backfilling historic data
■ Non-idempotent pipelines
■ Maintaining legacy DAGs/pipelines
■ Unit Testing
■ Integration Testing
■ E2E Testing
■ Realistic Input Data
■ Mock data
■ Sampled production data
■ Copy all of Production data
Effective Testing Strategy
CI/CD for Data–
Apply engineering best
practices to data
engineering
Build
■ Place all data
assets under data
version control
Test
■ Create isolated
data environments
on-demand
■ Automate all your
tests
Deploy
■ Data quality checks
passed
■ Automatic rollbacks
are in place
CI/CD Best Practices
Build
■ Place all data
assets under data
version control
Test (CI)
■ Create isolated
data environments
on-demand
■ Automate all your
tests
Deploy
■ Data quality checks
passed
■ Automatic rollbacks
are in place
CI/CD Best Practices
Build
■ Place all data
assets under data
version control
Test (CI)
■ Create isolated
data environments
on-demand
■ Automate all your
tests
Deploy (CD)
■ Data quality checks
passed
■ Automatic rollbacks
are in place
CI/CD Best Practices
Build – Version
Control
Data Repo
Branches of the
whole repository
Data Versioning at
the object level
Place All Data Assets Under Version
Control
Git for Data with lakeFS
s3://data-repo/collections/foo
lakefs://data-repo/main/collections/foo
lakectl branch create 
lakefs://repo/experiment-1 
--source lakefs://repo/main
# output:
# created branch 'experiment-1',
# pointing to commit ID: 'd1e9adc71c10a’
How lakeFS Works
Test - CI
Ingestion
Experimentation
Safe CI: Create Isolated Data Environments
On-Demand
Automated Tests Using lakeFS Hooks
Demo Time
Deploy - CD
■ Conceptually similar to Github actions
■ Hooks run as a standalone webserver - performs tests when triggered
■ Run custom tests or use test suites like Great Expectations, etc.,
Automating Data Quality Checks Using
lakeFS Hooks
■ post_merge event: Failed
Hook runs can
programmatically revert
the changes to a branch.
■ Reliability of data in Prod.
$lakectl revert main^1
CD: Automatic Rollbacks in Place
Easily revert
corrupted
data.
Time travel
to debug a
snapshot
Safely test your
pipelines on an
isolated branch
from production
Production
identical
isolated
branch and
hooks to test
your data
Data Lakes with lakeFS
.io
Thank You

More Related Content

Similar to CI/CD for Data – Building Dev/Test Data Environments With lakeFS

Rock Solid Deployment of Web Applications
Rock Solid Deployment of Web ApplicationsRock Solid Deployment of Web Applications
Rock Solid Deployment of Web ApplicationsPablo Godel
 
Getting to Walk with DevOps
Getting to Walk with DevOpsGetting to Walk with DevOps
Getting to Walk with DevOpsEklove Mohan
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Piyush Kumar
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsDurga Gadiraju
 
Build automation best practices
Build automation best practicesBuild automation best practices
Build automation best practicesCode Mastery
 
The Hard Problems of Continuous Deployment
The Hard Problems of Continuous DeploymentThe Hard Problems of Continuous Deployment
The Hard Problems of Continuous DeploymentTimothy Fitz
 
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
 Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e... Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...VMware Tanzu
 
Thick Application Penetration Testing - A Crash Course
Thick Application Penetration Testing - A Crash CourseThick Application Penetration Testing - A Crash Course
Thick Application Penetration Testing - A Crash CourseNetSPI
 
A Byte of Software Deployment
A Byte of Software DeploymentA Byte of Software Deployment
A Byte of Software DeploymentGong Haibing
 
CI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure DatabricksCI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure DatabricksGoDataDriven
 
Continuous Load Testing with CloudTest and Jenkins
Continuous Load Testing with CloudTest and JenkinsContinuous Load Testing with CloudTest and Jenkins
Continuous Load Testing with CloudTest and JenkinsSOASTA
 
Саша Белецкий "Continuous Delivery в продуктовой разработке"
Саша Белецкий "Continuous Delivery в продуктовой разработке"Саша Белецкий "Continuous Delivery в продуктовой разработке"
Саша Белецкий "Continuous Delivery в продуктовой разработке"Agile Base Camp
 
DevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceDevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceGrid Dynamics
 
DevOps World | Jenkins World 2018 and The Future of Jenkins
DevOps World | Jenkins World 2018 and The Future of JenkinsDevOps World | Jenkins World 2018 and The Future of Jenkins
DevOps World | Jenkins World 2018 and The Future of JenkinsNigel Charman
 
Modern Web-site Development Pipeline
Modern Web-site Development PipelineModern Web-site Development Pipeline
Modern Web-site Development PipelineGlobalLogic Ukraine
 
InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017Mandi Walls
 
Ship It ! with Ruby/ Rails Ecosystem
Ship It ! with Ruby/ Rails EcosystemShip It ! with Ruby/ Rails Ecosystem
Ship It ! with Ruby/ Rails EcosystemYi-Ting Cheng
 
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsSymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to DeploymentAerospike, Inc.
 

Similar to CI/CD for Data – Building Dev/Test Data Environments With lakeFS (20)

Rock Solid Deployment of Web Applications
Rock Solid Deployment of Web ApplicationsRock Solid Deployment of Web Applications
Rock Solid Deployment of Web Applications
 
Getting to Walk with DevOps
Getting to Walk with DevOpsGetting to Walk with DevOps
Getting to Walk with DevOps
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
 
Build automation best practices
Build automation best practicesBuild automation best practices
Build automation best practices
 
The Hard Problems of Continuous Deployment
The Hard Problems of Continuous DeploymentThe Hard Problems of Continuous Deployment
The Hard Problems of Continuous Deployment
 
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
 Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e... Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
 
Thick Application Penetration Testing - A Crash Course
Thick Application Penetration Testing - A Crash CourseThick Application Penetration Testing - A Crash Course
Thick Application Penetration Testing - A Crash Course
 
A Byte of Software Deployment
A Byte of Software DeploymentA Byte of Software Deployment
A Byte of Software Deployment
 
CI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure DatabricksCI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure Databricks
 
Continuous Load Testing with CloudTest and Jenkins
Continuous Load Testing with CloudTest and JenkinsContinuous Load Testing with CloudTest and Jenkins
Continuous Load Testing with CloudTest and Jenkins
 
Саша Белецкий "Continuous Delivery в продуктовой разработке"
Саша Белецкий "Continuous Delivery в продуктовой разработке"Саша Белецкий "Continuous Delivery в продуктовой разработке"
Саша Белецкий "Continuous Delivery в продуктовой разработке"
 
DevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 ConferenceDevOps for Big Data - Data 360 2014 Conference
DevOps for Big Data - Data 360 2014 Conference
 
DevOps World | Jenkins World 2018 and The Future of Jenkins
DevOps World | Jenkins World 2018 and The Future of JenkinsDevOps World | Jenkins World 2018 and The Future of Jenkins
DevOps World | Jenkins World 2018 and The Future of Jenkins
 
Modern Web-site Development Pipeline
Modern Web-site Development PipelineModern Web-site Development Pipeline
Modern Web-site Development Pipeline
 
SD Times - Docker v2
SD Times - Docker v2SD Times - Docker v2
SD Times - Docker v2
 
InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017
 
Ship It ! with Ruby/ Rails Ecosystem
Ship It ! with Ruby/ Rails EcosystemShip It ! with Ruby/ Rails Ecosystem
Ship It ! with Ruby/ Rails Ecosystem
 
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsSymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony Apps
 
Using Databases and Containers From Development to Deployment
Using Databases and Containers  From Development to DeploymentUsing Databases and Containers  From Development to Deployment
Using Databases and Containers From Development to Deployment
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

CI/CD for Data – Building Dev/Test Data Environments With lakeFS

  • 1. CI/CD for Data – Building Dev/Test Data Environments With lakeFS Vino SD, Developer Advocate
  • 2. About Me ■ I am Vino Duraisamy. ■ SWE -> Data/ML Engineer -> Developer Advocate ■ Open-source contributor @lakeFS @vinodhini_sd @vinodhini-sd vinodhini-sd.medium.com
  • 3. A Day in the Life of a Data Engineer! Expectation Reality
  • 4. What Can Go Wrong? ■ EMR/Spark upgrade ■ Schema changes ■ Change in business logic ■ Infrastructure changes ■ Troubleshooting failed spark jobs ■ Backfilling historic data ■ Non-idempotent pipelines ■ Maintaining legacy DAGs/pipelines
  • 5.
  • 6. ■ Unit Testing ■ Integration Testing ■ E2E Testing ■ Realistic Input Data ■ Mock data ■ Sampled production data ■ Copy all of Production data Effective Testing Strategy
  • 7. CI/CD for Data– Apply engineering best practices to data engineering
  • 8. Build ■ Place all data assets under data version control Test ■ Create isolated data environments on-demand ■ Automate all your tests Deploy ■ Data quality checks passed ■ Automatic rollbacks are in place CI/CD Best Practices
  • 9. Build ■ Place all data assets under data version control Test (CI) ■ Create isolated data environments on-demand ■ Automate all your tests Deploy ■ Data quality checks passed ■ Automatic rollbacks are in place CI/CD Best Practices
  • 10. Build ■ Place all data assets under data version control Test (CI) ■ Create isolated data environments on-demand ■ Automate all your tests Deploy (CD) ■ Data quality checks passed ■ Automatic rollbacks are in place CI/CD Best Practices
  • 12. Data Repo Branches of the whole repository Data Versioning at the object level Place All Data Assets Under Version Control
  • 13. Git for Data with lakeFS s3://data-repo/collections/foo lakefs://data-repo/main/collections/foo lakectl branch create lakefs://repo/experiment-1 --source lakefs://repo/main # output: # created branch 'experiment-1', # pointing to commit ID: 'd1e9adc71c10a’
  • 16. Ingestion Experimentation Safe CI: Create Isolated Data Environments On-Demand
  • 17. Automated Tests Using lakeFS Hooks
  • 20. ■ Conceptually similar to Github actions ■ Hooks run as a standalone webserver - performs tests when triggered ■ Run custom tests or use test suites like Great Expectations, etc., Automating Data Quality Checks Using lakeFS Hooks
  • 21. ■ post_merge event: Failed Hook runs can programmatically revert the changes to a branch. ■ Reliability of data in Prod. $lakectl revert main^1 CD: Automatic Rollbacks in Place
  • 22. Easily revert corrupted data. Time travel to debug a snapshot Safely test your pipelines on an isolated branch from production Production identical isolated branch and hooks to test your data Data Lakes with lakeFS
  • 23. .io