Honey I Shrunk the Database
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Honey I Shrunk the Database

on

  • 1,840 views

 

Statistics

Views

Total Views
1,840
Views on SlideShare
1,818
Embed Views
22

Actions

Likes
0
Downloads
19
Comments
0

5 Embeds 22

http://lanyrd.com 9
http://twitter.com 5
http://www.linkedin.com 5
https://www.linkedin.com 2
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I am Vanessa Hurst and I lead Data and Analytics at Paperless Post, a customizable online stationery startup in New York. I studied Computer Science and Systems and Information Engineering at the University of Virginia. I have experience in databases ranging from a few hundred megabyte CMSes for non-profits to terabytes of financial data and high traffic consumer websites. I've worked in data processing, product development, and business intelligence. I am happy open-source convert and lone data wrangler in a land of web developers using Ruby on Rails.
  • Static Data
  • This may include external, legal regulations or internal regulations such as terms of service.Data protection can also include mitigating risk or proactively screening before data is even available.HIPAA RegulationsPCI ComplianceAPI Terms of Use
  • Any other reasons?
  • RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • Quick vocabularyBackup & restore, trigger-based replication, there are plenty of options that are all straight forward, but don’t give you a lot of leeway on resources.
  • Most common case
  • If you’re doing Business Intelligence, you need a copy of your production database. Figure it out.
  • Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  • Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  • http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  • http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  • Static Data
  • Dedicated schema preserves all table, index, sequence names, etc
  • Only the build process is staging-specific, all other privileges and settings match production
  • Only the build process is staging-specific, all other privileges and settings match production
  • Only the build process is staging-specific, all other privileges and settings match production
  • Only the build process is staging-specific, all other privileges and settings match production
  • RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • http://github.com/rtomayko/replicate

Honey I Shrunk the Database Presentation Transcript

  • 1. Honey, I Shrunk the Database
    For Test and Development Environments
    Vanessa Hurst
    Paperless Post
    Postgres Open, September 2011
  • 2.
  • 3. User Data
  • 4. Why Shrink?
    Accuracy
    You don’t truly know how your app will behave in production unless you use real data.
    Production data is the ultimate in accuracy.
  • 5. Why Shrink?
    Accuracy
    Freshness
    New data should be available regularly.
    Full database refreshes should be timely.
  • 6. Why Shrink?
    Accuracy
    Freshness
    Resource Limitations
    Staging and developer machines cannot handle production load.
  • 7. Why Shrink?
    Accuracy
    Freshness
    Resource Limitations
    Data Protection
    Limit spread of sensitive user or client data.
  • 8. Why Shrink?
    Accuracy
    Freshness
    Resource Limitations
    Data Protection
  • 9. Case Study: Paperless Post
    Requirements
    Freshness – Daily, On command for non-developers
    Shrinkage – Slices, Mutations
  • 10. Case Study: Paperless Post
    Requirements
    Freshness – Daily, On command for non-developers
    Shrinkage – Slices, Mutations
    Resources
    Source – extra disk space, RAM, and CPUs
    Destination – limited, often entirely un-optimized
    Development -- constrained DBA resources
  • 11. Shrink Strategies
    Copies
    Restored backups or live replicas of entire production database
  • 12. Shrink Strategies
    Copies
    Slices
    Select portions of exact data
  • 13. Shrink Strategies
    Copies
    Slices
    Mutations
    Sanitized, anonymized, or otherwise changed data
  • 14. Shrink Strategies
    Copies
    Slices
    Mutations
    Assumptions
    Seed databases, fixtures, test data
  • 15. Shrink Strategies
    Copies
    Slices
    Mutations
    Assumptions
  • 16. Slices
    Vertical Slice
    Difficult to obtain a valid, useful subset of data.
    Example: Include some entire tables, exclude others
  • 17. Slices
    Vertical Slice
    Difficult to obtain a valid, useful subset of data.
    Example: Include some entire tables, exclude others
    Horizontal Slice
    Difficult to write and maintain.
    Example: SQL or application code to determine subset of data
  • 18. PG Tools – Vertical Slice
    Flexibility at Source (Production)
    pg_dump
    Include data only [-a --data-only]
    Include table schema only [-s --schema-only]
    Select tables [-t table1 table2 --table table1 table2]
    Select schemas [-nschema --schema=schema]
    Exclude schemas [-N schema --exclude-schema=schema]
  • 19. PG Tools – Vertical Slice
    Flexibility at Destination (Staging, Development)
    pg_restore
    Include data only [-a --data-only]
    Select indexes [-iindex --index=index]
    Tune processing [-jnumber-of-jobs --jobs=number-of-jobs]
    Select schemas [-nschema --schema=schema]
    Select triggers[-T trigger --trigger=trigger]
    Exclude privileges [-x --no-privileges --no-acl]
  • 20.
  • 21. Mutations
    External Data Protection
    HIPAA Regulations
    PCI Compliance
    API Terms of Use
  • 22. Mutations
    External Data Protection
    HIPAA Regulations
    PCI Compliance
    API Terms of Use
    Internal Data Protection
    Protecting your users’ personal data
    Protecting your users from accidents, e.g. staging emails
    Your Terms of Service
  • 23. User Data
  • 24. Case Study: Paperless Post
    Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 25. Case Study: Paperless Post
    Composite Slice includingVertical Slice – All application object schemas
    pg_dump --clean --schema-only --schema public db-01 > slice.sql
  • 26. Case Study: Paperless Post
    Composite Slice includingVertical Slice – All application object schemas
    pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content
    pg_dump --data-only --schema public -t cards db-01 >> slice.sql
  • 27. Case Study: Paperless Post
    Composite Slice includingVertical Slice – All application object schemas
    pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content
    pg_dump --data-only --schema public -t cards db-01 >> slice.sql
    Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 28. Case Study: Paperless Post
    CREATE SCHEMA staging;
  • 29. Case Study: Paperless Post
    Horizontal Slice
    Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
  • 30. Case Study: Paperless Post
    Horizontal Slice
    Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
    Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
  • 31. Case Study: Paperless Post
    Horizontal Slice
    Custom SQL
    Dynamic relative to full data set or newly created slice
    Mutations
    Email Addresses
    Use regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.com
    Cached Data
    Clear cached short link from link-shortening API
  • 32. Case Study: Paperless Post
    Composite Slice includingVertical Slice – All application object schemas
    pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content
    pg_dump --data-only --schema public -t cards db-01 >> slice.sql
    Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
    pg_dump --data-only --schema staging db-01 >> slice.sql
  • 33. Case Study: Paperless Post
    Rebuild
    Prepare new database as standby
    Gracefully close connections
    Rotate by renaming databases
    Security
    Dedicated database build user
    Membership in application user role
    Application user role & privileges remain
  • 34. Case Study: Paperless Post
    Rebuild
    $ bzcat slice.sql.bz2 | psql db-new
    Staging schema has not been created, so all data loads to default schema
  • 35. Case Study: Paperless Post
    We hacked our rebuild by importing across schemas!
    Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
  • 36. Secret Weapon
    --Updates all serial sequences for ID columns only
    BEGIN
    FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP
    table_name = table_record.relname::text;
    EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || '
    WHERE EXISTS (SELECT 1 FROM ' || table_name || ')';
    END LOOP;
  • 37. Case Study: Paperless Post
    Rebuild
    $ bzcat slice.sql.bz2 | psql db-new
    Staging schema has not been created, so all data loads to default schema
    echo “select 1 from update_id_sequences();” >> slice.sql
    Vacuum
    Reindex
  • 38. Case Study: Paperless Post
    Security
    Database build user
    CREATE DB privileges
    Member of Application user role
    Application user remains database owner
    Application user privileges remain limited
    Build only works in predetermined environments
  • 39. Case Study: Paperless Post
    Requirements
    Freshness – Daily, On command for non-developers
    Shrinkage – Slices, Mutations
    Resources
    Source – extra disk space, RAM, and CPUs
    Destination – limited, often entirely un-optimized
    Development -- constrained DBA resources
  • 40. Questions?
    Vanessa Hurst
    Paperless Post
    @DBNess
    Postgres Open, September 2011
  • 41. More Tools
    Copies -- LVMSnapshots
    See talk by Jon Erdman at PG Conf EU
    Great for all reads
    Data stays virtualized & doesn’t take up space until changed
    Ideal for DDL changes without actual data changes
  • 42. More Tools
    Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging
    Simple -- pauses pgbouncer & restores backup
    Efficient -- leverage bulk loading
    Flexible -- supports varying psql files
    Custom -- limited
    Slices -- replicate by rtomayko of Github http://github.com/rtomayko/replicate
    Simple - Preserves object relations via ActiveRecord
    Inefficient -- Creates text-based .dump
    Inflexible -- Corrupts id sequences on data insert
    Custom -- highly