Leverage capabilities in AWS to automate data sanitization from MySQL RDS instances with easy storage and retrieval utilizing S3 for short term storage and Glacier for long term storage.
Why Teams call analytics are critical to your entire business
Automated Database Sanitization with AWS
1. TOOLS OF THE TRADE:
Automated Database Sanitization with AWS
Dee Wilcox
Nashville PHP Monthly Meetup
April 11, 2017
2. About Me
● Senior Software Developer at NASBA.
● Not a Nashville native, but it’s been home for 7 years.
● Married for 14 years, 2 daughters ages 6 and 1.
● Teaching myself to code since 2002.
● Passionate about the maker movement, mentoring and
empowering women in tech, and building healthy teams.
3. Full Stack Dev Tools of the Trade
Webserver
● Linux
● Apache
● Easy to create
development
environments with
AWS EC2s or
Lightsail, Docker,
and a host of other
providers.
Database
● MySQL or Postgres
○ Integrated
LAMP
webserver
● Dedicated
database server
● AWS RDS
instance
Application Layer
● Application code
runs on the
webserver and
connects out to the
database
● Environments are
managed through
application
configuration files
4. The Problem:
How do we cleanly reload
production data to a
development or testing
environment without
compromising security?
6. Option 1:
Web Based On
Demand Solution
One option is to create a simple web
application that is designed to
retrieve, sanitize, and store sanitized
MySQL dump files so that they are
easily accessible.
7. Benefits of an On Demand Solution
Easy Maintenance
● PHP application code is easy to maintain
● Development team can modify and improve
Easy on the Database
● Dump files are only created as needed
● Better for storage management
Easy Tracking
● Easy tracking for recent requests
● Helps eliminate duplicate requests
Easy Storage and Retrieval
● A common storage and retrieval mechanism
streamlines processes for Ops & Development
8. In Practice
Failing Un-Gracefully
● Failures to find or execute the
sanitization routines were not captured
or returned to the user, causing data
dumps to either not be created, or to
be created while still containing
sensitive data.
Room for improvement:
● Capturing all types of MySQL errors
● Logging and notification controls for
successful and unsuccessful processing.
Too Tightly Coupled
● Tightly coupling the sanitization code
from the data layer with the
application code made it difficult to
maintain a separation of concerns.
● In an environment with separated
development, operations, and database
administration roles, this made the
process more cumbersome.
Room for improvement:
● Separate application and data layers.
9. Option 2:
Automated Solution
Another option is to create a simple
shell script (or two) that runs nightly
on a utility server with read-only
access to the production database (or
production replica).
10. Benefits of an Automated Solution
Easier Maintenance
● Bash code is easy to read and maintain for
both Operations and Development
Easier on the Database
● Dump files are created when the databases are
least-used
Easier Tracking
● No request management. Dump files are
delivered for all databases nightly
Easy Storage and Retrieval
● A common storage and retrieval mechanism
streamlines processes for Ops & Development
11. In Practice
Boring Bash
● Bash shell scripting has its place, but
it’s not necessarily a beloved
programming language in the PHP
development community.
● Luckily, it’s still easy to work with.
Room for improvement:
● The bash script is long and procedural.
It could be organized into methods,
which would be easier for most PHP
developers to follow.
Tightly Defined Sanitization
Parameters
● The current sanitization scripts
perform standardized sanitization on
table columns known to contain
sensitive data.
● They do not scan the data tables
utilizing regex to identify or mask
additional PII.
Room for improvement:
● Smarter sanitization logic.
12. Webserver Database Storage & Retrieval
Leveraging AWS
LightSail or EC2 MySQL RDS
Utility Scripts
● Automated sanitization and storage
● Automated retrieval
MySQL Replica
S3 and Glacier
13. Step 1:
Set up the
Environment
Key dependencies:
● Linux webserver
● AWS Credentials for S3
● AWS CLI
● Database credentials
○ Username and
password OR
○ Login path
● Repository for
sanitizer_db
14. Setting up Sanitizer DB
Create Statements
● Define the databases that need to be
sanitized.
● Include specific and accurate create
statements that match the production
configuration for these databases.
Grants and Definers
● Make sure your new database user has
read-only access to the other databases
and write access to create and drop
new databases.
Sanitization Routines
● Clearly define the data to be sanitized.
● Use queries or stored routines -
whatever fits your environment best.
15. Step 2:
Set up the
Database
# Create databases and definers
mysql --login-path=local <
/data/sites/sanitizer_db/databases/creat
e.sql
mysql --login-path=local <
/data/sites/sanitizer_db/databases/defin
ers.sql
# Loop through sanitization routines
cd /data/sites/sanitizer_db/routines
for routine in sanitize*.sql
do
routine_name=$(echo $routine | sed
's/.sql//')
16. Step 3:
Compile the
Sanitization
Routines and Empty
Databases
database=$(echo $routine | sed 's/.sql//g
; s/^sanitize_// ; s/_noop//')
database_filename=$database$filename
# Drop the database in sanitizer if it
already exists
mysql --login-path=local -e "drop database
if exists sanitize_$database;"
# Create a database
mysql --login-path=local -e "create
database sanitize_$database default
character set utf8;"
# Compile the stored routine for
sanitization
mysql --login-path=local < $routine
17. Step 4:
Load and Sanitize
# Generate dump files of each database
mysqldump --login-path=local
--lock-tables=false $database | mysql
--login-path=local sanitize_$database
# Run sanitization and capture output
sanitized=$(echo "call
sanitize_$database.sanitize_$database(1)
;" | mysql --login-path=local)
18. Step 5:
Catch Errors
if [ "$?" == 0]
then
echo "There was a problem executing
the stored routine."
fi
if [ -z "$sanitized" ]
then
sanitized_fail+="$database "
fi
19. Step 6:
Dump and
Compress Sanitized
Data
if [ "$sanitized" ]
then
# Add entries to sanitized success array.
sanitized_success+="$database "
# Remove existing sanitized file
rm -f
"$local_directory"/"$database_filename"
# Create compressed mysqldump file
mysqldump --login-path=local
--lock-tables=false --no-create-info
--skip-triggers $database | bzip2 >
"$local_directory"/"$database_filename"
# Send to S3
/usr/local/bin/aws --profile $s3_profile
s3 mv
"$local_directory"/"$database_filename"
"$s3_url"/"$database_filename" --region
$s3_region
fi
20. Step 7:
Clean Up the
Environment
# Drop the sanitized database
mysql --login-path=local -e
"drop database sanitize_$database;"
# Remove the SQL file if it
still exists
rm -f "$database"_sanitized.sql
done
21. Storage and Retrieval in S3
Using the AWS CLI
● Credentials must be defined and exist in
~/.aws/config
● Include parameters for region and profile
● Encryption flag is only needed on
retrieval of the file
Using a Scheduler
● Utilized simple crontab functionality to
create a scheduled job.
● Use AWS Lambda to schedule events.
Audit Controls
● Use the $sanitized_fail and
$sanitized_success arrays to track
successes and failures.
● Make use of logging and notifications
to meet audit requirements and
immediately notify users of any issues.
23. Where to Find Me
Twitter
https://twitter.com/dee_wilcox
LinkedIn
https://www.linkedin.com/in/
deewilcox
Google
https://plus.google.com/+Dee
WilcoxOnline
Github
https://github..com/deewilcox