SlideShare a Scribd company logo
1 of 42
Honey, I Shrunk the Database For Test and Development Environments Vanessa Hurst Paperless Post Postgres Open, September 2011
User Data
Why Shrink? Accuracy You don’t truly know how your app will behave in production unless you use real data. Production data is the ultimate in accuracy.
Why Shrink? Accuracy Freshness New data should be available regularly. Full database refreshes should be timely.
Why Shrink? Accuracy Freshness Resource Limitations Staging and developer machines cannot handle production load.
Why Shrink? Accuracy Freshness Resource Limitations Data Protection Limit spread of sensitive user or client data.
Why Shrink? Accuracy Freshness Resource Limitations Data Protection
Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations
Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
Shrink Strategies Copies Restored backups or live replicas of entire production database
Shrink Strategies Copies Slices Select portions of exact data
Shrink Strategies Copies Slices Mutations Sanitized, anonymized, or otherwise changed data
Shrink Strategies Copies Slices Mutations Assumptions Seed databases, fixtures, test data
Shrink Strategies Copies Slices Mutations Assumptions
Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others
Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others Horizontal Slice Difficult to write and maintain. Example: SQL or application code to determine subset of data
PG Tools – Vertical Slice Flexibility at Source (Production) pg_dump Include data only [-a --data-only] Include table schema only [-s --schema-only] Select tables [-t table1 table2 --table table1 table2] Select schemas [-nschema --schema=schema] Exclude schemas [-N schema --exclude-schema=schema]
PG Tools – Vertical Slice Flexibility at Destination (Staging, Development) pg_restore Include data only [-a --data-only] Select indexes [-iindex --index=index] Tune processing [-jnumber-of-jobs --jobs=number-of-jobs] Select schemas [-nschema --schema=schema] Select triggers[-T trigger --trigger=trigger] Exclude privileges [-x --no-privileges --no-acl]
Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use
Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use Internal Data Protection Protecting your users’ personal data Protecting your users from accidents, e.g. staging emails Your Terms of Service
User Data
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql 	Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
Case Study: Paperless Post CREATE SCHEMA staging;
Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users); Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
Case Study: Paperless Post Horizontal Slice Custom SQL Dynamic relative to full data set or newly created slice Mutations Email Addresses Use regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.com Cached Data Clear cached short link from link-shortening API
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql 	Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses pg_dump --data-only --schema staging db-01 >> slice.sql
Case Study: Paperless Post Rebuild Prepare new database as standby Gracefully close connections Rotate by renaming databases Security				 Dedicated database build user Membership in application user role Application user role & privileges remain
Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema
Case Study: Paperless Post We hacked our rebuild by importing across schemas! Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
Secret Weapon  --Updates all serial sequences for ID columns only BEGIN FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP table_name = table_record.relname::text; 	EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || '  	WHERE EXISTS (SELECT 1 FROM ' || table_name || ')'; END LOOP;
Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema echo “select 1 from update_id_sequences();” >> slice.sql Vacuum Reindex
Case Study: Paperless Post Security					 Database build user CREATE DB privileges Member of Application user role Application user remains database owner Application user privileges remain limited Build only works in predetermined environments
Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
Questions? Vanessa Hurst Paperless Post @DBNess Postgres Open, September 2011
More Tools Copies -- LVMSnapshots See talk by Jon Erdman at PG Conf EU Great for all reads Data stays virtualized & doesn’t take up space until changed Ideal for DDL changes without actual data changes
More Tools Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging Simple -- pauses pgbouncer & restores backup Efficient -- leverage bulk loading Flexible -- supports varying psql files Custom -- limited Slices -- replicate by rtomayko of Github	http://github.com/rtomayko/replicate Simple - Preserves object relations via ActiveRecord Inefficient -- Creates text-based .dump Inflexible -- Corrupts id sequences on data insert Custom -- highly

More Related Content

What's hot

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper David Paquette
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Salesforce Summer 14 Release
Salesforce Summer 14 ReleaseSalesforce Summer 14 Release
Salesforce Summer 14 ReleaseJyothylakshmy P.U
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 

What's hot (19)

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Dapper performance
Dapper performanceDapper performance
Dapper performance
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Salesforce Summer 14 Release
Salesforce Summer 14 ReleaseSalesforce Summer 14 Release
Salesforce Summer 14 Release
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
HEPData workshop talk
HEPData workshop talkHEPData workshop talk
HEPData workshop talk
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 

Similar to Honey I Shrunk the Database

Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAmin Uddin
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersAdam Hutson
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
Sql storeprocedure
Sql storeprocedureSql storeprocedure
Sql storeprocedureftz 420
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR MasterclassIan Massingham
 
NoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATNoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATHenryBowers
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 

Similar to Honey I Shrunk the Database (20)

Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure Presentation
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Sql storeprocedure
Sql storeprocedureSql storeprocedure
Sql storeprocedure
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
NoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATNoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RAT
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Lab manual asp.net
Lab manual asp.netLab manual asp.net
Lab manual asp.net
 

More from Vanessa Hurst

Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013Vanessa Hurst
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Vanessa Hurst
 
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)Vanessa Hurst
 
Coders as Superheroes
Coders as SuperheroesCoders as Superheroes
Coders as SuperheroesVanessa Hurst
 
Get Your Website Off the Ground
Get Your Website Off the GroundGet Your Website Off the Ground
Get Your Website Off the GroundVanessa Hurst
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsDefense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsVanessa Hurst
 

More from Vanessa Hurst (7)

Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
 
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
 
Coders as Superheroes
Coders as SuperheroesCoders as Superheroes
Coders as Superheroes
 
Get Your Website Off the Ground
Get Your Website Off the GroundGet Your Website Off the Ground
Get Your Website Off the Ground
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsDefense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMs
 
WTF Web Lecture
WTF Web LectureWTF Web Lecture
WTF Web Lecture
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Honey I Shrunk the Database

  • 1. Honey, I Shrunk the Database For Test and Development Environments Vanessa Hurst Paperless Post Postgres Open, September 2011
  • 2.
  • 4. Why Shrink? Accuracy You don’t truly know how your app will behave in production unless you use real data. Production data is the ultimate in accuracy.
  • 5. Why Shrink? Accuracy Freshness New data should be available regularly. Full database refreshes should be timely.
  • 6. Why Shrink? Accuracy Freshness Resource Limitations Staging and developer machines cannot handle production load.
  • 7. Why Shrink? Accuracy Freshness Resource Limitations Data Protection Limit spread of sensitive user or client data.
  • 8. Why Shrink? Accuracy Freshness Resource Limitations Data Protection
  • 9. Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations
  • 10. Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
  • 11. Shrink Strategies Copies Restored backups or live replicas of entire production database
  • 12. Shrink Strategies Copies Slices Select portions of exact data
  • 13. Shrink Strategies Copies Slices Mutations Sanitized, anonymized, or otherwise changed data
  • 14. Shrink Strategies Copies Slices Mutations Assumptions Seed databases, fixtures, test data
  • 15. Shrink Strategies Copies Slices Mutations Assumptions
  • 16. Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others
  • 17. Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others Horizontal Slice Difficult to write and maintain. Example: SQL or application code to determine subset of data
  • 18. PG Tools – Vertical Slice Flexibility at Source (Production) pg_dump Include data only [-a --data-only] Include table schema only [-s --schema-only] Select tables [-t table1 table2 --table table1 table2] Select schemas [-nschema --schema=schema] Exclude schemas [-N schema --exclude-schema=schema]
  • 19. PG Tools – Vertical Slice Flexibility at Destination (Staging, Development) pg_restore Include data only [-a --data-only] Select indexes [-iindex --index=index] Tune processing [-jnumber-of-jobs --jobs=number-of-jobs] Select schemas [-nschema --schema=schema] Select triggers[-T trigger --trigger=trigger] Exclude privileges [-x --no-privileges --no-acl]
  • 20.
  • 21. Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use
  • 22. Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use Internal Data Protection Protecting your users’ personal data Protecting your users from accidents, e.g. staging emails Your Terms of Service
  • 24. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 25. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql
  • 26. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql
  • 27. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 28. Case Study: Paperless Post CREATE SCHEMA staging;
  • 29. Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
  • 30. Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users); Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
  • 31. Case Study: Paperless Post Horizontal Slice Custom SQL Dynamic relative to full data set or newly created slice Mutations Email Addresses Use regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.com Cached Data Clear cached short link from link-shortening API
  • 32. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses pg_dump --data-only --schema staging db-01 >> slice.sql
  • 33. Case Study: Paperless Post Rebuild Prepare new database as standby Gracefully close connections Rotate by renaming databases Security Dedicated database build user Membership in application user role Application user role & privileges remain
  • 34. Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema
  • 35. Case Study: Paperless Post We hacked our rebuild by importing across schemas! Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
  • 36. Secret Weapon --Updates all serial sequences for ID columns only BEGIN FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP table_name = table_record.relname::text; EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' WHERE EXISTS (SELECT 1 FROM ' || table_name || ')'; END LOOP;
  • 37. Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema echo “select 1 from update_id_sequences();” >> slice.sql Vacuum Reindex
  • 38. Case Study: Paperless Post Security Database build user CREATE DB privileges Member of Application user role Application user remains database owner Application user privileges remain limited Build only works in predetermined environments
  • 39. Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
  • 40. Questions? Vanessa Hurst Paperless Post @DBNess Postgres Open, September 2011
  • 41. More Tools Copies -- LVMSnapshots See talk by Jon Erdman at PG Conf EU Great for all reads Data stays virtualized & doesn’t take up space until changed Ideal for DDL changes without actual data changes
  • 42. More Tools Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging Simple -- pauses pgbouncer & restores backup Efficient -- leverage bulk loading Flexible -- supports varying psql files Custom -- limited Slices -- replicate by rtomayko of Github http://github.com/rtomayko/replicate Simple - Preserves object relations via ActiveRecord Inefficient -- Creates text-based .dump Inflexible -- Corrupts id sequences on data insert Custom -- highly

Editor's Notes

  1. I am Vanessa Hurst and I lead Data and Analytics at Paperless Post, a customizable online stationery startup in New York. I studied Computer Science and Systems and Information Engineering at the University of Virginia. I have experience in databases ranging from a few hundred megabyte CMSes for non-profits to terabytes of financial data and high traffic consumer websites. I've worked in data processing, product development, and business intelligence. I am happy open-source convert and lone data wrangler in a land of web developers using Ruby on Rails.
  2. Static Data
  3. This may include external, legal regulations or internal regulations such as terms of service.Data protection can also include mitigating risk or proactively screening before data is even available.HIPAA RegulationsPCI ComplianceAPI Terms of Use
  4. Any other reasons?
  5. RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  6. RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  7. Quick vocabularyBackup & restore, trigger-based replication, there are plenty of options that are all straight forward, but don’t give you a lot of leeway on resources.
  8. Most common case
  9. If you’re doing Business Intelligence, you need a copy of your production database. Figure it out.
  10. Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  11. Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  12. http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  13. http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  14. Static Data
  15. Dedicated schema preserves all table, index, sequence names, etc
  16. Only the build process is staging-specific, all other privileges and settings match production
  17. Only the build process is staging-specific, all other privileges and settings match production
  18. Only the build process is staging-specific, all other privileges and settings match production
  19. Only the build process is staging-specific, all other privileges and settings match production
  20. RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  21. http://github.com/rtomayko/replicate