Successfully reported this slideshow.

Schema Evolution Patterns - Texas Scalability Summit 2019

1

Share

Loading in …3
×
1 of 50
1 of 50

Schema Evolution Patterns - Texas Scalability Summit 2019

1

Share

Download to read offline

This is an updated version of the talk I gave at Velocity SJ 2019. This talk is about schema evolution - what happens when the structure of your structured data changes. We look at why schema evolution is complicated, how the notion of schema compatibility helps manage that complexity, and how to do data migrations in those cases where compatibility isn't an option. The talk is a mix of theoretical principles and examples of how those principles are applied in practice at some of the world's largest tech companies.

This is an updated version of the talk I gave at Velocity SJ 2019. This talk is about schema evolution - what happens when the structure of your structured data changes. We look at why schema evolution is complicated, how the notion of schema compatibility helps manage that complexity, and how to do data migrations in those cases where compatibility isn't an option. The talk is a mix of theoretical principles and examples of how those principles are applied in practice at some of the world's largest tech companies.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Schema Evolution Patterns - Texas Scalability Summit 2019

  1. 1. Schema Evolution 
 Patterns Alex Rasmussen alex@bitsondisk.com John Gould (14.Sep.1804 - 3.Feb.1881) [Public domain]
  2. 2. Hi, I’m Alex! https://www.bitsondisk.com/ LA-based 
 Data Engineering 
 Consultant Twitter/GitHub/LinkedIn/…: 
 @alexras
  3. 3. A Deceptively Hard Problem •Classic three-tier web service - Multiple servers for scalability - Rolling updates for high availability - API for extensibility •How do we make changes to data? - Let’s focus on one table, people LB / API Gateway DB App App App App Users API Clients
  4. 4. Deceptively Hard Problem #1 •Add administrative users - Need to add is_admin to people table - … but clients with the old schema will fail to write if they don’t provide is_admin! Won’t they?
  5. 5. Deceptively Hard Problem #2 •Splitting name into first_name, last_name - Old clients will keep writing to name - New clients will expect first_name and last_name to be defined in old data •How do we do this update safely?
  6. 6. Schema Evolution •When your data’s shape (it’s schema) changes •Why is this hard? - Schemas can’t change everywhere instantly - Client code can be very difficult to update - If client and data schemas don’t agree, 
 it can cause serious problems
  7. 7. How Do We Handle This? •We need to give the illusion of instant schema change to clients, with minimal code change. •In this talk, we’ll look at how.
  8. 8. Goal of This Talk •Broadly-applicable concepts, techniques,
 and patterns for schema evolution - Schema compatibility for 
 transparent schema change - Data migration when compatibility 
 isn’t possible or practical •How this looks in practice
  9. 9. 1. SCHEMA COMPATIBILITY 2. DATA MIGRATION a. SINGLE SCHEMA b. MULTI-SCHEMA 3. TAKEAWAYS
  10. 10. 1. SCHEMA COMPATIBILITY 2. DATA MIGRATION a. SINGLE SCHEMA b. MULTI-SCHEMA 3. TAKEAWAYS
  11. 11. The Illusion of Instant Change •Instant schema change everywhere isn’t possible, but we want to give the illusion that it is - Goal #1: Clients can still read and write safely, even if their schemas are different - Goal #2: Code change to clients is minimized •Schema compatibility makes this easier to do
  12. 12. Schema Compatibility •If two schemas are compatible, evolving from one schema to another can be done
 automatically on read •Clients can be oblivious to schema change •Two directions: backwards and forwards
  13. 13. Compatibility X X+1 Backwards-Compatibility Data written with old schema 
 readable by clients with new schema C C X X+1 Forwards-Compatibility Data written with new schema 
 readable by clients with old schema
  14. 14. Add a Field With a Default name: string, age: integer, is_admin: boolean 
 (default: false) name: “Bob Jones”, age: 42 name: “Tom Peters”, age: 32, is_admin: false X X+1 CX CX+1 Backwards: reading , CX+1 adds is_admin = false Forwards: reading , CX ignores is_admin
  15. 15. Remove a Field With a Default name: “Alice Smith”, age: 29, is_admin: true pto_days_left: 16 name: “Carol Danvers”, age: 34, is_admin: true X X+1 CX CX+1 Backwards: reading , CX+1 ignores pto_days_left Forwards: reading , CX adds pto_days_left = 0 name: string, age: integer, is_admin: boolean 
 (default: false) pto_days_left: integer (default: 0)
  16. 16. Other Types of Changes •Without defaults: - Adding a field breaks backwards-compatibility 
 (in older data, field value is undefined) - Removing a field breaks forwards-compatibility 
 (for older clients, field value is undefined) •Renaming (e.g. ssn to social_security_number): 
 it depends
  17. 17. In Practice: API Design •So far, focused on DBs •Compatibility is especially important for APIs - Lots of clients you might not control - API version bumps need to happen when incompatible schema changes happen
  18. 18. In Practice - Protocol Buffers message Person { required string name = 1; required int32 age = 2; optional bool is_admin = 3 
 [default = false]; } •Field numbers make renames compatible •In version 3, no required or optional - 
 required broke backwards-compatibility too often
  19. 19. In Practice: Stripe •Goal: API responses readable by all old clients w/o code change. •API server has latest schema, but clients keep schema forever •Solution: Version change modules applied in reverse order from server’s version to client’s version (they admit: this is hard) 2 31 SC 3to2( )2to1( )
  20. 20. Recap •Compatibility allows for transparent 
 movement between schemas •Changes can be 
 backwards-compatible, 
 forwards-compatible, both, or neither •Ease-of-compatibility drives the design of many messaging formats
  21. 21. 1. SCHEMA COMPATIBILITY 2. DATA MIGRATION a. SINGLE SCHEMA b. MULTI-SCHEMA 3. TAKEAWAYS
  22. 22. Crossing Compatibility Gaps •Need a plan for when compatibility 
 isn’t an option - Not all schema changes are compatible - Not all incompatibilities are simple - Not all compatible changes are practical
  23. 23. Complex Changes name: string, first_name: string, last_name: string, age: integer, is_admin: boolean (default: false) •Not obvious how to split; code changes required •Two field additions without defaults: not backwards-compatible •Field removal without default: not forwards-compatible
  24. 24. Impractical Changes •e.g. Adding a column in MySQL (<v8) requires locking/copying the table - Days to weeks not unheard of for tables with millions of rows
  25. 25. Crossing Compatibility Gaps •Compatibility gaps are crossed with 
 data migrations - minimally disruptive 
 movement between schemas •We’ll look at: - Single-schema stores (e.g. RDBMS) - Multi-schema stores (e.g. MongoDB, Kafka)
  26. 26. 1. SCHEMA COMPATIBILITY 2. DATA MIGRATION a. SINGLE SCHEMA b. MULTI-SCHEMA 3. TAKEAWAYS
  27. 27. Three-Tier Web Architecture S C2C1 C3 C4 Load Balancer name “Bob Jones” “Alice Smith” “Jamie Lee Curtis” first_name last_name “Bob” “Jones” “Alice” “Smith” “Jamie Lee” “Curtis”
  28. 28. Single-Schema Migration X X+1 C1 C2 C3 C4 S Move from X 
 to (incompatible) X + 1 
 without downtime
  29. 29. Step 1: Create and migrate temporary store S’ C1 C2 C3 C4 S S’ X X+1
  30. 30. Step 2: Create a copier and an updater C1 C2 C3 C4 S S’ U C X X+1
  31. 31. Step 3.1: Move clients over to new schema C1 C2 C3 C4 S S’ U C X X+1
  32. 32. Step 3.2: Copy data, record / apply updates S S’ U C X X+1 C1 C2 C3 C4
  33. 33. Step 4: Cutover - S’ becomes S C1 C2 C3 C4 SSold U X X+1
  34. 34. Step 5: Drain updater, delete Sold C1 C2 C3 C4 S X X+1
  35. 35. In Practice - Percona • pt-online-schema-change - Copier: scan/copy in timed chunks - Updater: synchronous table triggers - Cutover: RENAME TABLE
  36. 36. In Practice - GitHub •gh-ost - Copier: chunked reads/writes - Updater: read binlog, interleave copies - Cutover: 2-step blocking swap
  37. 37. Recap •In single-schema stores: - Migrate clients gradually, maintaining the illusion of the old schema to old clients - Migrate data to new schema over time, applying updates to old and new copies - When migration complete, then cut over
  38. 38. 1. SCHEMA COMPATIBILITY 2. DATA MIGRATION a. SINGLE SCHEMA b. MULTI-SCHEMA 3. TAKEAWAYS
  39. 39. Multi-Schema Stores { “name”: “Alice Smith”, “age”: 29, “organization”: “Engineering” } { “name”: “Bob Jones”, “age”: 42, } { “name”: “Carol Danvers”, “age”: 34, “organization”: “Security” } •Data with different schemas coexisting in the same store •MongoDB: collections of documents •Kafka: topics of messages •Want illusion of single schema
  40. 40. Multi-Schema Migration C1 C2 C3 X X+1 Move data from 
 schema X to
 (backwards-incompatible)
 schema X + 1
 without blocking clients
  41. 41. C1 C2 C3 X X+1 Step 1: Old clients write with new schema, 
 continue reading with old schema (old clients are still compatible!) C1 C2
  42. 42. C1 C2 C3 X X+1 Step 2: Migrate old data to new schema C1 C2
  43. 43. C1 C2 C3 X X+1 Step 3: Old clients read and write 
 with new schema
  44. 44. In Practice: Kafka (Confluent) •Schema-aware clients transparently apply compatible changes •Backwards-incompatible changes: 
 update writers first •Forwards-incompatible changes: 
 update readers first
  45. 45. Recap •In multi-schema stores: - Make old clients generate compatible data
 (by writing or reading with new schema) - Migrate old data to new schema - Old clients read and write with new schema
  46. 46. 1. SCHEMA COMPATIBILITY 2. DATA MIGRATION a. SINGLE SCHEMA b. MULTI-SCHEMA 3. TAKEAWAYS
  47. 47. Summary •Schemas can’t change everywhere instantly •Schema compatibility can transparently provide the illusion of instant change •Data migrations fill in compatibility gaps, carefully keeping clients working
  48. 48. Takeaways •This applies to DB schema changes and API versioning, but it also applies to CSV/JSON/Excel, etc. •If your data has structure, it probably has a schema, & these concepts apply
  49. 49. Takeaways •Reason about schema evolution up-front to guide your architecture choices - Prefer compatible changes •Have a plan for dealing with incompatibility - Present the illusion of instant schema change •Remember: this is a hard problem for everyone!
  50. 50. Thank You! Questions? https://www.bitsondisk.com/ Consulting Inquiries: 
 alex@bitsondisk.com John Gould (14.Sep.1804 - 3.Feb.1881) [Public domain]

×