Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

•

0 likes•374 views

This document discusses applying chaos engineering principles to complex data systems. It recommends defining a steady state, acknowledging real-world events, running manual experiments in production, and automating production experiments. This helps discover weaknesses and manage the data lifecycle through stages like development, experimentation, deployment, and production. It also presents the idea of using a "Git for Data" system like lakeFS to version data and enable features like branching, rolling back, and atomic updates to more easily develop, test, and recover from errors in data pipelines.

Technology

“Don’t make the assumption that things in
Spark just work, there is a good chance
that Spark underneath the hood is going to
do something unexpected”
2
Holden Karau
Apache Spark PMC
2017, BeeScala
@adipolak

What can go wrong with Data flows?
EVERYTHING
3
● new component logic
● new data source
● introducing incompatible schema
change
● Kafka broker failed to validate record
● Spark job runs twice, in parallel
● changing tables’ relationship keys
● accidentally delete yesterday’s
`events/` partition
● data duplication
@adipolak

Testing is hard!
Distributed data systems with many moving parts
4
Unit testing
Integration testing
System testing
E2E …
@adipolak

Chaos Engineering in the
World of Large-Scale Complex
Data Flow
Adi Polak
Treeverse

4 Principles:
6
• Defining a steady-state
• Acknowledging a variety of real-world events
• Running manual experiments in production
• Automating production experiments
@adipolak
Chaos Engineering

#1 - Steady state
• system’s throughput
• error rates
• latency percentiles
• etc
DevOps / SRE
7
• Data products requirements
Data Engineer
@adipolak

Data Product Requirements
8
• Data Quality
• Accuracy
• No duplications
• SLA
@adipolak

In 3 lines of code..
9
# Read data from file
df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’)
# Some transformation over the data related to finance
updated_df = df… do some magic!
updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’)
1
2
3
4
5
6
7
8
9
10
@adipolak
Data Duplication

#2 - Vary Real-world Events
hardware failures like :
• servers dying
• spike in traffic
DevOps / SRE
10
• schema change
• corrupted data record
• data variance
• accidentally delete yesterday’s `events/`
partition
Data Engineer
@adipolak

#3 – Run Experiments in Production
• Production machines and network
DevOps / SRE
12
• Production data
Data Engineer
@adipolak
🙀

Experimental environment
On production data in production ..
13
s3://data-repo/collections/foo
s3://data-repo/exp1/collections/foo
@adipolak
★
★

#4 – Automate Experiments to Run
Continuously in Prod
• Discover weaknesses
DevOps / SRE
14
• Manage Data lifecycle in stages
Data Engineer
@adipolak
& Minimize Blast Radius

Roll Back
Troubleshoot
Production
Version control
Best Practices & Data
Quality
Deployment
Data product stages & requirements
Experimentation
Debug
Collaborate
.
Development
15
@adipolak

16
What's the best
way to automate
data stages
propagation in
production?
@adipolak

Branching Strategy
Like source control -
17
@adipolak

Challenge Solution
$ spark.read.parquet
(‘s3://my-repo/<commit_id>’)
$ lakectl revert main^1
$ lakectl branch create my-branch
$ lakectl merge my-branch
main
$ lakectl branch create new-logic
$ lakectl merge new-logic main
Quickly recover from an error
Develop in Isolation
Troubleshoot
Atomic Update
Reprocess
Git for Data

22
What Does A Typical Environment Look
Like?
@adipolak

23
Recap
• Adopting Principles of Chaos Engineering to data systems
• Data Lifecycle Management stages
• Git for Data
• lakeFS

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Connecting Hadoop and OracleTanel Poder

Common SQL Server Mistakes and How to Avoid Them with Tim RadneyEmbarcadero Technologies

Data as a Service Kyle Hailey

SQL Analytics Powering Telemetry Analysis at ComcastDatabricks

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain

Python performance profilingJon Haddad

Migrating to Database 12c Multitenant - New Opportunities To Get It Right!Performance Tuning Corporation

DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax

Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

[262] netflix 빅데이터 플랫폼NAVER D2

Pig on Sparkmortardata

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner

Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A

Deploying Data Science Engines to ProductionMostafa Majidpour

Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...Andreas Grabner

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022 (20)

Jump Start with Apache Spark 2.0 on Databricks

Connecting Hadoop and Oracle

Common SQL Server Mistakes and How to Avoid Them with Tim Radney

Data as a Service

SQL Analytics Powering Telemetry Analysis at Comcast

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Python performance profiling

Migrating to Database 12c Multitenant - New Opportunities To Get It Right!

DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Introduction SQL Analytics on Lakehouse Architecture

[262] netflix 빅데이터 플랫폼

Pig on Spark

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Deploying Data Science Engines to Production

Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...

Jump Start on Apache® Spark™ 2.x with Databricks

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

CloudStudio User manual (basic edition):comworks

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Install Stable Diffusion in windows machinePadma Pradeep

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Key Features Of Token Development (1).pptxLBM Solutions

Slack Application Development 101 Slidespraypatel2

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Artificial intelligence in the post-deep learning eraDeakin University

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

CloudStudio User manual (basic edition):

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

How to Remove Document Management Hurdles with X-Docs?

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Install Stable Diffusion in windows machine

Human Factors of XR: Using Human Factors to Design XR Systems

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Pigging Solutions Piggable Sweeping Elbows

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Key Features Of Token Development (1).pptx

Slack Application Development 101 Slides

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Pigging Solutions in Pet Food Manufacturing

Artificial intelligence in the post-deep learning era

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

1. 1

2. “Don’t make the assumption that things in Spark just work, there is a good chance that Spark underneath the hood is going to do something unexpected” 2 Holden Karau Apache Spark PMC 2017, BeeScala @adipolak

3. What can go wrong with Data flows? EVERYTHING 3 ● new component logic ● new data source ● introducing incompatible schema change ● Kafka broker failed to validate record ● Spark job runs twice, in parallel ● changing tables’ relationship keys ● accidentally delete yesterday’s `events/` partition ● data duplication @adipolak

4. Testing is hard! Distributed data systems with many moving parts 4 Unit testing Integration testing System testing E2E … @adipolak

5. Chaos Engineering in the World of Large-Scale Complex Data Flow Adi Polak Treeverse

6. 4 Principles: 6 • Defining a steady-state • Acknowledging a variety of real-world events • Running manual experiments in production • Automating production experiments @adipolak Chaos Engineering

7. #1 - Steady state • system’s throughput • error rates • latency percentiles • etc DevOps / SRE 7 • Data products requirements Data Engineer @adipolak

8. Data Product Requirements 8 • Data Quality • Accuracy • No duplications • SLA @adipolak

9. In 3 lines of code.. 9 # Read data from file df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’) # Some transformation over the data related to finance updated_df = df… do some magic! updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’) 1 2 3 4 5 6 7 8 9 10 @adipolak Data Duplication

10. #2 - Vary Real-world Events hardware failures like : • servers dying • spike in traffic DevOps / SRE 10 • schema change • corrupted data record • data variance • accidentally delete yesterday’s `events/` partition Data Engineer @adipolak

11. 11 @adipolak

12. #3 – Run Experiments in Production • Production machines and network DevOps / SRE 12 • Production data Data Engineer @adipolak 🙀

13. Experimental environment On production data in production .. 13 s3://data-repo/collections/foo s3://data-repo/exp1/collections/foo @adipolak ★ ★

14. #4 – Automate Experiments to Run Continuously in Prod • Discover weaknesses DevOps / SRE 14 • Manage Data lifecycle in stages Data Engineer @adipolak & Minimize Blast Radius

15. Roll Back Troubleshoot Production Version control Best Practices & Data Quality Deployment Data product stages & requirements Experimentation Debug Collaborate . Development 15 @adipolak

16. 16 What's the best way to automate data stages propagation in production? @adipolak

17. Branching Strategy Like source control - 17 @adipolak

18. 18 @adipolak Git for Data

19. Challenge Solution $ spark.read.parquet (‘s3://my-repo/<commit_id>’) $ lakectl revert main^1 $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl branch create new-logic $ lakectl merge new-logic main Quickly recover from an error Develop in Isolation Troubleshoot Atomic Update Reprocess Git for Data

20.

21. 21 How Does lakeFS Work? @adipolak

22. 22 What Does A Typical Environment Look Like? @adipolak

23. 23 Recap • Adopting Principles of Chaos Engineering to data systems • Data Lifecycle Management stages • Git for Data • lakeFS

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

Recommended

Recommended

More Related Content

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022