Presenting Today
Jon Hyman

CTO & Cofounder, Braze
@jon_hyman
Redis Tames the
Caching Herd
Braze empowers you to
humanize your brand’s
relationships with your
customers at scale.
1 Trillion
DATA POINTS
PROCESSED PER
QUARTER
1B+
MESSAGES
SENT DAILY
1.6B
MONTHLY
ACTIVE USERS
It started with an 

Apdex page against
our API at 2:22 AM.
We saw high CPU utilization on an API layer for one of our clusters
Throughput sampled at ~33%, response time was ~5x normal
This computation was taking up most of the API call
Triage
• Our on-call engineer increased the API
autoscale group server count per runbook
• Starting with 57 c4.4xlarge servers, we
added capacity to try to resolve Apdex
• Despite 123 more servers ($71,669/mo
additional cost!), Apdex did not go away
• Adding more servers made things worse
• API continued to throw errors
Braze in-app messaging architecture
• Braze SDKs have a business rule engine for when to
show in-app messages (“IAM”s)
• The client requests IAMs from the API on the app
open for that session
• The API reads possible IAMs from the database or
Memcached
• The API computes IAM target criteria against user
profile and stores calculated target criteria in
Memcached with a TTL of 90 seconds
• The API returns a set of possible IAMs to the client
device
Client Device
User 123
IAMs
API Servers
Database
CACHE
14 seconds to compute?!?!
Happening ~6k times every 90 seconds?!?!
What was going on?
• High volume of API requests (~20,000/second)
• The customer had added a lot of new IAMs with sophisticated targeting
rules
• Every 90 seconds, ~6,000 API calls took 14 seconds to complete
• Cache stampeding herd issue: once the cache expired, ~6,000 requests
immediately attempted to populate it back
• Computation is CPU-intensive
• Of course this won’t scale!
How to fix this? Redis.
Redis cache control
• We used Redis to control a refresh of the cache using SETNX
locks
• We extended Memcached TTL to 180 seconds, with 1 process
refreshing the cache every 90 seconds
Full code available at https://github.com/jonhyman/redisconf2019
Success!
API requests loading IAMs dropped to 1 per 90 seconds
Computation now only took 3–4 seconds instead of 14 due to decreased concurrency
Success!
With latency stabilized, we were able to drop back down to 57 API servers
Success!
Thank you! We're hiring!
braze.com/careers
Code available at https://github.com/jonhyman/redisconf2019
Redis Tames The Caching Herd: Jon Hyman

Redis Tames The Caching Herd: Jon Hyman

  • 1.
    Presenting Today Jon Hyman
 CTO& Cofounder, Braze @jon_hyman
  • 2.
  • 3.
    Braze empowers youto humanize your brand’s relationships with your customers at scale. 1 Trillion DATA POINTS PROCESSED PER QUARTER 1B+ MESSAGES SENT DAILY 1.6B MONTHLY ACTIVE USERS
  • 4.
    It started withan 
 Apdex page against our API at 2:22 AM.
  • 5.
    We saw highCPU utilization on an API layer for one of our clusters
  • 6.
    Throughput sampled at~33%, response time was ~5x normal This computation was taking up most of the API call
  • 7.
    Triage • Our on-callengineer increased the API autoscale group server count per runbook • Starting with 57 c4.4xlarge servers, we added capacity to try to resolve Apdex • Despite 123 more servers ($71,669/mo additional cost!), Apdex did not go away • Adding more servers made things worse • API continued to throw errors
  • 8.
    Braze in-app messagingarchitecture • Braze SDKs have a business rule engine for when to show in-app messages (“IAM”s) • The client requests IAMs from the API on the app open for that session • The API reads possible IAMs from the database or Memcached • The API computes IAM target criteria against user profile and stores calculated target criteria in Memcached with a TTL of 90 seconds • The API returns a set of possible IAMs to the client device Client Device User 123 IAMs API Servers Database CACHE
  • 9.
    14 seconds tocompute?!?! Happening ~6k times every 90 seconds?!?!
  • 10.
    What was goingon? • High volume of API requests (~20,000/second) • The customer had added a lot of new IAMs with sophisticated targeting rules • Every 90 seconds, ~6,000 API calls took 14 seconds to complete • Cache stampeding herd issue: once the cache expired, ~6,000 requests immediately attempted to populate it back • Computation is CPU-intensive • Of course this won’t scale!
  • 11.
    How to fixthis? Redis.
  • 12.
    Redis cache control •We used Redis to control a refresh of the cache using SETNX locks • We extended Memcached TTL to 180 seconds, with 1 process refreshing the cache every 90 seconds Full code available at https://github.com/jonhyman/redisconf2019
  • 13.
    Success! API requests loadingIAMs dropped to 1 per 90 seconds
  • 14.
    Computation now onlytook 3–4 seconds instead of 14 due to decreased concurrency Success!
  • 15.
    With latency stabilized,we were able to drop back down to 57 API servers Success!
  • 16.
    Thank you! We'rehiring! braze.com/careers Code available at https://github.com/jonhyman/redisconf2019