Your SlideShare is downloading. ×
0
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Honey I Shrunk the Database
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Honey I Shrunk the Database

1,582

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,582
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • I am Vanessa Hurst and I lead Data and Analytics at Paperless Post, a customizable online stationery startup in New York. I studied Computer Science and Systems and Information Engineering at the University of Virginia. I have experience in databases ranging from a few hundred megabyte CMSes for non-profits to terabytes of financial data and high traffic consumer websites. I've worked in data processing, product development, and business intelligence. I am happy open-source convert and lone data wrangler in a land of web developers using Ruby on Rails.
  • Static Data
  • This may include external, legal regulations or internal regulations such as terms of service.Data protection can also include mitigating risk or proactively screening before data is even available.HIPAA RegulationsPCI ComplianceAPI Terms of Use
  • Any other reasons?
  • RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • Quick vocabularyBackup & restore, trigger-based replication, there are plenty of options that are all straight forward, but don’t give you a lot of leeway on resources.
  • Most common case
  • If you’re doing Business Intelligence, you need a copy of your production database. Figure it out.
  • Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  • Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  • http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  • http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  • Static Data
  • Dedicated schema preserves all table, index, sequence names, etc
  • Only the build process is staging-specific, all other privileges and settings match production
  • Only the build process is staging-specific, all other privileges and settings match production
  • Only the build process is staging-specific, all other privileges and settings match production
  • Only the build process is staging-specific, all other privileges and settings match production
  • RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  • http://github.com/rtomayko/replicate
  • Transcript

    • 1. Honey, I Shrunk the Database<br />For Test and Development Environments<br />Vanessa Hurst<br />Paperless Post<br />Postgres Open, September 2011<br />
    • 2.
    • 3. User Data<br />
    • 4. Why Shrink?<br />Accuracy<br />You don’t truly know how your app will behave in production unless you use real data.<br />Production data is the ultimate in accuracy.<br />
    • 5. Why Shrink?<br />Accuracy<br />Freshness<br />New data should be available regularly.<br />Full database refreshes should be timely.<br />
    • 6. Why Shrink?<br />Accuracy<br />Freshness<br />Resource Limitations<br />Staging and developer machines cannot handle production load.<br />
    • 7. Why Shrink?<br />Accuracy<br />Freshness<br />Resource Limitations<br />Data Protection<br />Limit spread of sensitive user or client data. <br />
    • 8. Why Shrink?<br />Accuracy<br />Freshness<br />Resource Limitations<br />Data Protection<br />
    • 9. Case Study: Paperless Post<br />Requirements<br />Freshness – Daily, On command for non-developers<br />Shrinkage – Slices, Mutations<br />
    • 10. Case Study: Paperless Post<br />Requirements<br />Freshness – Daily, On command for non-developers<br />Shrinkage – Slices, Mutations<br />Resources<br />Source – extra disk space, RAM, and CPUs<br />Destination – limited, often entirely un-optimized<br />Development -- constrained DBA resources<br />
    • 11. Shrink Strategies<br />Copies<br />Restored backups or live replicas of entire production database<br />
    • 12. Shrink Strategies<br />Copies<br />Slices<br />Select portions of exact data<br />
    • 13. Shrink Strategies<br />Copies<br />Slices<br />Mutations<br />Sanitized, anonymized, or otherwise changed data<br />
    • 14. Shrink Strategies<br />Copies<br />Slices<br />Mutations<br />Assumptions<br />Seed databases, fixtures, test data<br />
    • 15. Shrink Strategies<br />Copies<br />Slices<br />Mutations<br />Assumptions<br />
    • 16. Slices<br />Vertical Slice<br />Difficult to obtain a valid, useful subset of data.<br />Example: Include some entire tables, exclude others<br />
    • 17. Slices<br />Vertical Slice<br />Difficult to obtain a valid, useful subset of data.<br />Example: Include some entire tables, exclude others<br />Horizontal Slice<br />Difficult to write and maintain.<br />Example: SQL or application code to determine subset of data<br />
    • 18. PG Tools – Vertical Slice<br />Flexibility at Source (Production)<br />pg_dump<br />Include data only [-a --data-only]<br />Include table schema only [-s --schema-only]<br />Select tables [-t table1 table2 --table table1 table2]<br />Select schemas [-nschema --schema=schema]<br />Exclude schemas [-N schema --exclude-schema=schema]<br />
    • 19. PG Tools – Vertical Slice<br />Flexibility at Destination (Staging, Development)<br />pg_restore<br />Include data only [-a --data-only]<br />Select indexes [-iindex --index=index]<br />Tune processing [-jnumber-of-jobs --jobs=number-of-jobs]<br />Select schemas [-nschema --schema=schema]<br />Select triggers[-T trigger --trigger=trigger]<br />Exclude privileges [-x --no-privileges --no-acl]<br />
    • 20.
    • 21. Mutations<br />External Data Protection<br />HIPAA Regulations<br />PCI Compliance<br />API Terms of Use<br />
    • 22. Mutations<br />External Data Protection<br />HIPAA Regulations<br />PCI Compliance<br />API Terms of Use<br />Internal Data Protection<br />Protecting your users’ personal data<br />Protecting your users from accidents, e.g. staging emails<br />Your Terms of Service<br />
    • 23. User Data<br />
    • 24. Case Study: Paperless Post<br />Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses<br />
    • 25. Case Study: Paperless Post<br />Composite Slice includingVertical Slice – All application object schemas<br />pg_dump --clean --schema-only --schema public db-01 &gt; slice.sql<br />
    • 26. Case Study: Paperless Post<br />Composite Slice includingVertical Slice – All application object schemas<br />pg_dump --clean --schema-only --schema public db-01 &gt; slice.sqlVertical Slice – Entire tables of static content<br />pg_dump --data-only --schema public -t cards db-01 &gt;&gt; slice.sql<br />
    • 27. Case Study: Paperless Post<br />Composite Slice includingVertical Slice – All application object schemas<br />pg_dump --clean --schema-only --schema public db-01 &gt; slice.sqlVertical Slice – Entire tables of static content<br />pg_dump --data-only --schema public -t cards db-01 &gt;&gt; slice.sql<br /> Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses<br />
    • 28. Case Study: Paperless Post<br />CREATE SCHEMA staging;<br />
    • 29. Case Study: Paperless Post<br />Horizontal Slice<br />Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);<br />
    • 30. Case Study: Paperless Post<br />Horizontal Slice<br />Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);<br />Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);<br />
    • 31. Case Study: Paperless Post<br />Horizontal Slice<br />Custom SQL<br />Dynamic relative to full data set or newly created slice<br />Mutations<br />Email Addresses<br />Use regular expressions to clean non-admin addressese.g. dude@gmail.com =&gt; staging+dudegmailcom@paperlesspost.com<br />Cached Data<br />Clear cached short link from link-shortening API<br />
    • 32. Case Study: Paperless Post<br />Composite Slice includingVertical Slice – All application object schemas<br />pg_dump --clean --schema-only --schema public db-01 &gt; slice.sqlVertical Slice – Entire tables of static content<br />pg_dump --data-only --schema public -t cards db-01 &gt;&gt; slice.sql<br /> Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses<br />pg_dump --data-only --schema staging db-01 &gt;&gt; slice.sql<br />
    • 33. Case Study: Paperless Post<br />Rebuild<br />Prepare new database as standby<br />Gracefully close connections<br />Rotate by renaming databases<br />Security <br />Dedicated database build user<br />Membership in application user role<br />Application user role &amp; privileges remain<br />
    • 34. Case Study: Paperless Post<br />Rebuild<br />$ bzcat slice.sql.bz2 | psql db-new<br />Staging schema has not been created, so all data loads to default schema<br />
    • 35. Case Study: Paperless Post<br />We hacked our rebuild by importing across schemas!<br />Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.<br />
    • 36. Secret Weapon<br /> --Updates all serial sequences for ID columns only<br />BEGIN<br />FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = &apos;r&apos; AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = &apos;id&apos; AND pa.attrelid = pc.oid) LOOP<br />table_name = table_record.relname::text;<br /> EXECUTE &apos;SELECT setval(pg_get_serial_sequence(&apos; || quote_literal(table_name) || &apos;, &apos; || quote_literal(&apos;id&apos;)::text || &apos;), MAX(id)) FROM &apos; || table_name || &apos; <br /> WHERE EXISTS (SELECT 1 FROM &apos; || table_name || &apos;)&apos;;<br />END LOOP;<br />
    • 37. Case Study: Paperless Post<br />Rebuild<br />$ bzcat slice.sql.bz2 | psql db-new<br />Staging schema has not been created, so all data loads to default schema<br />echo “select 1 from update_id_sequences();” &gt;&gt; slice.sql<br />Vacuum<br />Reindex<br />
    • 38. Case Study: Paperless Post<br />Security <br />Database build user<br />CREATE DB privileges<br />Member of Application user role<br />Application user remains database owner<br />Application user privileges remain limited<br />Build only works in predetermined environments<br />
    • 39. Case Study: Paperless Post<br />Requirements<br />Freshness – Daily, On command for non-developers<br />Shrinkage – Slices, Mutations<br />Resources<br />Source – extra disk space, RAM, and CPUs<br />Destination – limited, often entirely un-optimized<br />Development -- constrained DBA resources<br />
    • 40. Questions?<br />Vanessa Hurst<br />Paperless Post<br />@DBNess<br />Postgres Open, September 2011<br />
    • 41. More Tools<br />Copies -- LVMSnapshots<br />See talk by Jon Erdman at PG Conf EU<br />Great for all reads<br />Data stays virtualized &amp; doesn’t take up space until changed<br />Ideal for DDL changes without actual data changes<br />
    • 42. More Tools<br />Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging<br />Simple -- pauses pgbouncer &amp; restores backup<br />Efficient -- leverage bulk loading<br />Flexible -- supports varying psql files<br />Custom -- limited<br />Slices -- replicate by rtomayko of Github http://github.com/rtomayko/replicate<br />Simple - Preserves object relations via ActiveRecord<br />Inefficient -- Creates text-based .dump<br />Inflexible -- Corrupts id sequences on data insert<br />Custom -- highly<br />

    ×