Honey I Shrunk the Database

Honey, I Shrunk the Database For Test and Development Environments Vanessa Hurst Paperless Post Postgres Open, September 2011

Why Shrink? Accuracy You don’t truly know how your app will behave in production unless you use real data. Production data is the ultimate in accuracy.

Why Shrink? Accuracy Freshness New data should be available regularly. Full database refreshes should be timely.

Why Shrink? Accuracy Freshness Resource Limitations Staging and developer machines cannot handle production load.

Why Shrink? Accuracy Freshness Resource Limitations Data Protection Limit spread of sensitive user or client data.

Why Shrink? Accuracy Freshness Resource Limitations Data Protection

Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations

Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources

Shrink Strategies Copies Restored backups or live replicas of entire production database

Shrink Strategies Copies Slices Select portions of exact data

Shrink Strategies Copies Slices Mutations Sanitized, anonymized, or otherwise changed data

Shrink Strategies Copies Slices Mutations Assumptions Seed databases, fixtures, test data

Shrink Strategies Copies Slices Mutations Assumptions

Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others

Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others Horizontal Slice Difficult to write and maintain. Example: SQL or application code to determine subset of data

PG Tools – Vertical Slice Flexibility at Source (Production) pg_dump Include data only [-a --data-only] Include table schema only [-s --schema-only] Select tables [-t table1 table2 --table table1 table2] Select schemas [-nschema --schema=schema] Exclude schemas [-N schema --exclude-schema=schema]

PG Tools – Vertical Slice Flexibility at Destination (Staging, Development) pg_restore Include data only [-a --data-only] Select indexes [-iindex --index=index] Tune processing [-jnumber-of-jobs --jobs=number-of-jobs] Select schemas [-nschema --schema=schema] Select triggers[-T trigger --trigger=trigger] Exclude privileges [-x --no-privileges --no-acl]

Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use

Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use Internal Data Protection Protecting your users’ personal data Protecting your users from accidents, e.g. staging emails Your Terms of Service

Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses

Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql

Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql

Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses

Case Study: Paperless Post CREATE SCHEMA staging;

Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);

Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users); Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);

Case Study: Paperless Post Horizontal Slice Custom SQL Dynamic relative to full data set or newly created slice Mutations Email Addresses Use regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.com Cached Data Clear cached short link from link-shortening API

Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses pg_dump --data-only --schema staging db-01 >> slice.sql

Case Study: Paperless Post Rebuild Prepare new database as standby Gracefully close connections Rotate by renaming databases Security Dedicated database build user Membership in application user role Application user role & privileges remain

Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema

Case Study: Paperless Post We hacked our rebuild by importing across schemas! Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.

Secret Weapon --Updates all serial sequences for ID columns only BEGIN FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP table_name = table_record.relname::text; EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' WHERE EXISTS (SELECT 1 FROM ' || table_name || ')'; END LOOP;

Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema echo “select 1 from update_id_sequences();” >> slice.sql Vacuum Reindex

Case Study: Paperless Post Security Database build user CREATE DB privileges Member of Application user role Application user remains database owner Application user privileges remain limited Build only works in predetermined environments

Questions? Vanessa Hurst Paperless Post @DBNess Postgres Open, September 2011

More Tools Copies -- LVMSnapshots See talk by Jon Erdman at PG Conf EU Great for all reads Data stays virtualized & doesn’t take up space until changed Ideal for DDL changes without actual data changes

More Tools Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging Simple -- pauses pgbouncer & restores backup Efficient -- leverage bulk loading Flexible -- supports varying psql files Custom -- limited Slices -- replicate by rtomayko of Github http://github.com/rtomayko/replicate Simple - Preserves object relations via ActiveRecord Inefficient -- Creates text-based .dump Inflexible -- Corrupts id sequences on data insert Custom -- highly

Honey I Shrunk the Database

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Honey I Shrunk the Database

Similar to Honey I Shrunk the Database (20)

More from Vanessa Hurst

More from Vanessa Hurst (7)

Recently uploaded

Recently uploaded (20)

Honey I Shrunk the Database

Editor's Notes