Redis Tames The Caching Herd: Jon Hyman

Presenting Today
Jon Hyman 
CTO & Cofounder, Braze
@jon_hyman

Braze empowers you to
humanize your brand’s
relationships with your
customers at scale.
1 Trillion
DATA POINTS
PROCESSED PER
QUARTER
1B+
MESSAGES
SENT DAILY
1.6B
MONTHLY
ACTIVE USERS

It started with an  
Apdex page against
our API at 2:22 AM.

We saw high CPU utilization on an API layer for one of our clusters

Throughput sampled at ~33%, response time was ~5x normal
This computation was taking up most of the API call

Triage
• Our on-call engineer increased the API
autoscale group server count per runbook
• Starting with 57 c4.4xlarge servers, we
added capacity to try to resolve Apdex
• Despite 123 more servers ($71,669/mo
additional cost!), Apdex did not go away
• Adding more servers made things worse
• API continued to throw errors

Braze in-app messaging architecture
• Braze SDKs have a business rule engine for when to
show in-app messages (“IAM”s)
• The client requests IAMs from the API on the app
open for that session
• The API reads possible IAMs from the database or
Memcached
• The API computes IAM target criteria against user
profile and stores calculated target criteria in
Memcached with a TTL of 90 seconds
• The API returns a set of possible IAMs to the client
device
Client Device
User 123
IAMs
API Servers
Database
CACHE

14 seconds to compute?!?!
Happening ~6k times every 90 seconds?!?!

What was going on?
• High volume of API requests (~20,000/second)
• The customer had added a lot of new IAMs with sophisticated targeting
rules
• Every 90 seconds, ~6,000 API calls took 14 seconds to complete
• Cache stampeding herd issue: once the cache expired, ~6,000 requests
immediately attempted to populate it back
• Computation is CPU-intensive
• Of course this won’t scale!

Redis cache control
• We used Redis to control a refresh of the cache using SETNX
locks
• We extended Memcached TTL to 180 seconds, with 1 process
refreshing the cache every 90 seconds
Full code available at https://github.com/jonhyman/redisconf2019

Success!
API requests loading IAMs dropped to 1 per 90 seconds

Computation now only took 3–4 seconds instead of 14 due to decreased concurrency
Success!

With latency stabilized, we were able to drop back down to 57 API servers
Success!

Thank you! We're hiring!
braze.com/careers
Code available at https://github.com/jonhyman/redisconf2019

Redis Tames The Caching Herd: Jon Hyman

Redis Tames The Caching Herd: Jon Hyman

More Related Content

What's hot

Similar to Redis Tames The Caching Herd: Jon Hyman

More from Redis Labs

Recently uploaded

Redis Tames The Caching Herd: Jon Hyman