Fixing Web Data in Production

Fixing Web Data
in Production
Best practices for bad situations
Aaron Knight, Full Stack Engineer at Voxy

Aaron Knight (@iamaaronknight)
Full Stack Engineer
Voxy.com
I am

A Django web app
8 years old
12 engineers
10+ data stores
Voxy is

Oops.
There’s a problem with the data!
(It was probably my fault)

> SELECT * FROM feature_toggles LIMIT 2;
+-------+----------+--------------------+
| id | user_id | orientation_videos |
|-------+----------+--------------------+
| 1234 | 8923123 | f |
| 1235 | 9213483 | f |
| 1236 | 2136935 | f |

> UPDATE feature_toggles SET
orientation_videos = 't' WHERE...

Hold up!
What could co wrong?
● You bring down the site.
● You make the problem worse.
● You forget what you did.

Never change
data in prod
● Never introduce any bugs.
● Make all the right architecture decisions
the first time.

Data fixes are code.
● Check them in to source control.

● Test them.

● Test them.
● Code review them.

Track execution.
● Log when a script is executed.

Track execution.
● Log everything that changed.

Track execution.
● Log what did not change.

def fix_feature_toggles():
logger.info('Starting fix_feature_toggles script')
for toggle in FeatureToggle.objects.all():
if toggle.orientation_videos:
logger.info('FeatureToggle {} orientation_videos
already exists; skipping'.format(toggle.id))
else:
toggle.orientation_videos = get_correct_value(toggle)
toggle.save()
logger.info(
'FeatureToggle {} orientation_videos updated to
{}'.format(toggle.id, toggle.orientation_videos))
logger.info('Finished fix_feature_toggles script')

Track execution.
● Centralize your logging.

import boto3
firehose = boto3.client('firehose')
def log_to_kinesis(message):
data = OrderedDict([
('script_name', get_filename_of_caller()),
('environment', settings.ENVIRONMENT),
('ts', str(pytz.utc.localize(datetime.datetime.now()))),
('message', message),
])
firehose.put_record(
DeliveryStreamName='backfill-logs',
Record={'Data': (json.dumps(data, sort_keys=False) + 'n')}
)

Track execution.
● Centralize your logging.
● Track the script’s progress.

import tqdm
def backfill_toggles():
count = FeatureToggle.objects.count()
for org in tqdm(FeatureToggle.objects.all(), count=count):
...

Be fault-tolerant.
● Think of possible exceptions.

for user_id in list_of_user_ids:
try:
toggle = FeatureToggle.objects.get(user_id=user_id)
except FeatureToggle.DoesNotExist:
logger.info('FeatureToggle does not exist for User
{}'.format(User_id))
continue
toggle.orientation_videos = True
toggle.save()

Be fault-tolerant.
● Make your scripts idempotent, if possible.

for user_id in list_of_user_ids:
try:
toggle = FeatureToggle.objects.get(user_id=user_id,
backfilled=False)
except FeatureToggle.DoesNotExist:
continue
toggle.orientation_videos = True
toggle.backfilled = True
toggle.save()

Be fault-tolerant.
● Make your scripts idempotent, if possible.
● Make your changes reversible, if possible.

{"environment": "production", "ts": "2017-10-02 18:33:08.805645+00:00",
"message": "unit_id 117 resource_id: None > 597ca7531ce6856f34607de9"}
{"environment": "production", "ts": "2017-10-02 18:33:08.878832+00:00",
"message": "unit_id 28 resource_id: None > 54ca763ca8615a76184dd4a9"}

Know your bottlenecks.
● CPU?

● CPU?
● Memory?

● CPU?
● Memory?
● Database?

feature_toggles = []
for user_id in user_ids_to_backfill:
feature_toggles.append(FeatureToggle(
user_id=user_id,
orientation_videos=True
)
)
FeatureToggle.objects.bulk_create(feature_toggles)

def backfill_activity_progresses():
conn = psycopg2.connect("some_credentials")
cursor = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
data_to_replicate = []
for index in tqdm(batch_range):
cursor.execute("SELECT user_id, type, correct_answers,
total_answers, is_complete FROM legacy_activity ORDER BY id;")
data_to_replicate.append(cursor.fetchall())
conn.close()
add_to_new_activity_table(data_to_replicate)

● CPU?
● Memory?
● Database?
● Developer time?
● Cognitive overhead?

Execute at the right level
of abstraction.
● Use existing functions and the ORM when
you can afford to.
● Use SQL when execution time becomes
significant.

Use database
snapshots.
● Test your script on a backup from
production.

Use database
snapshots.
production.
● Take snapshots before you make changes.

Use database
snapshots.
production.
● Take snapshots before you make changes.
● Automate your backups.

Aaron Knight (@iamaaronknight)
Full Stack Engineer
Voxy.com
Thanks!

Fixing Web Data in Production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fixing Web Data in Production

Similar to Fixing Web Data in Production (20)

Recently uploaded

Recently uploaded (20)

Fixing Web Data in Production

Editor's Notes