Counting Image Views using Redis Cluster
Seandon Mooy
DevOps Engineer
@erulabs
Counting Image Views using Redis Cluster
Or…. how I stopped map-reducing and learned to love the stream
Seandon Mooy
DevOps Engineer
@erulabs
3 Billion!
Delay!
Delay!
Failures!
Delay!
Failures!
Failures!
Also… I may not be the best zookeeper
Challenges with Hbase
Roughly 5% of all requests
through THRIFT were
failing… So many tunables!
Challenges with Hbase
Roughly 5% of all requests
through THRIFT were
failing… So many tunables!
Optimized timeouts,
added circuitbreakers, etc
Trickle of working requests
during outage means circuit
breakers are hard to design…
Challenges with Hbase
Roughly 5% of all requests
through THRIFT were
failing… So many tunables!
Optimized timeouts,
added circuitbreakers, etc
Trickle of working requests
during outage means circuit
breakers are hard to design…
“Hbase down == Imgur down”
Downtime == sadtime :(
3 Billion!
Solution?
Redis Cluster!
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Ingest service
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Ingest service
Parses syslog lines, reports
metrics via statsd
Fastly
ViewCount V2 - Real time with less complexity!
TCP syslog stream
Ingest service
Parses syslog lines, reports
metrics via statsd
Redis 3.2 cluster!
Fastly
ViewCount V2 - Real time with less complexity!
Ingest service
Hbase Backfill service
Fastly
ViewCount V2 - Real time with less complexity!
Ingest service
Hbase Backfill service
Internet
API service
ViewCount V2 - Results:
ViewCount V2 - Results:
Request latency:
min: 1ms
max: 16.9ms
median: 1.6ms
p95: 2.6ms
p99: 4.6ms
Codes:
200: 10000
ViewCount V2 - Results:
Request latency:
min: 1ms
max: 16.9ms
median: 1.6ms
p95: 2.6ms
p99: 4.6ms
Codes:
200: 10000
ViewCount V2 - Results:
20 billion commands!
> 400GB in memory!
Things to be aware of:
1. Redis Cluster shard maps - redirections, etc.
Monitor redirections - gracefully restart workers after shard moves
2. AOF can slow down / fail large “redis-trib.rb” operations.
Make sure to disable before / re-enable after!
3. Not all legacy systems support Redis Cluster, and if they do…
They might not support it well (PHP-FPM)!
4. Over memory capacity behavior?
Previously we would hard-crash - now we’d LRU old 1-view images.
Neither are good, but for us, one is much less painful
ViewCount V3?
Approaching the point of minimal gains for man-hours, but what else might be fun?
1. Moving PHP7 off NodeJS API and directly to Redis Cluster
Downsides: dealing with shard maps is complex is a stateless / process-per-request environment!
2. Using redis3's BITFIELD or HSet to save on key storage costs
Downsides: complicate the system, reduce “hit-by-a-bus” issues - keys are just hashes, values are just counts!
3. Dealing with the nature of TCP Streams (TCP is not HTTP!)
One connection to rule them all! - Node’s Cluster module helps,
but perhaps Rust or Golang?
Downsides: Vertical scaling is non-obvious on EC2
ViewCount V2 - Results:
Redis is:
Faster - Imgur response time decreased ~50ms
ViewCount V2 - Results:
Redis is:
Faster - Imgur response time decreased ~50ms
Cheaper - EC2 cost reduced by 75%
ViewCount V2 - Results:
Redis is:
Faster - Imgur response time decreased ~50ms
Cheaper - EC2 cost reduced by 75%
Simpler - No Java, no MR, no ZK, no third parties, just INCR + GET!
Redis is:
Faster - Imgur response time decreased ~50ms
Cheaper - EC2 cost reduced by 75%
Simpler - No Java, no MR, no ZK, no third parties, just INCR + GET!
More fun! - I got to talk at RedisConf17!
ViewCount V2 - Results:
Acknowledgment
Imgur DevOps Team
Imgur Platform Team

Counting image views using redis cluster

  • 1.
    Counting Image Viewsusing Redis Cluster Seandon Mooy DevOps Engineer @erulabs
  • 2.
    Counting Image Viewsusing Redis Cluster Or…. how I stopped map-reducing and learned to love the stream Seandon Mooy DevOps Engineer @erulabs
  • 6.
  • 8.
  • 9.
  • 10.
  • 11.
    Also… I maynot be the best zookeeper
  • 12.
    Challenges with Hbase Roughly5% of all requests through THRIFT were failing… So many tunables!
  • 13.
    Challenges with Hbase Roughly5% of all requests through THRIFT were failing… So many tunables! Optimized timeouts, added circuitbreakers, etc Trickle of working requests during outage means circuit breakers are hard to design…
  • 14.
    Challenges with Hbase Roughly5% of all requests through THRIFT were failing… So many tunables! Optimized timeouts, added circuitbreakers, etc Trickle of working requests during outage means circuit breakers are hard to design… “Hbase down == Imgur down” Downtime == sadtime :(
  • 15.
  • 16.
  • 17.
    Fastly ViewCount V2 -Real time with less complexity! TCP syslog stream
  • 18.
    Fastly ViewCount V2 -Real time with less complexity! TCP syslog stream Ingest service
  • 19.
    Fastly ViewCount V2 -Real time with less complexity! TCP syslog stream Ingest service Parses syslog lines, reports metrics via statsd
  • 20.
    Fastly ViewCount V2 -Real time with less complexity! TCP syslog stream Ingest service Parses syslog lines, reports metrics via statsd Redis 3.2 cluster!
  • 21.
    Fastly ViewCount V2 -Real time with less complexity! Ingest service Hbase Backfill service
  • 22.
    Fastly ViewCount V2 -Real time with less complexity! Ingest service Hbase Backfill service Internet API service
  • 23.
  • 24.
    ViewCount V2 -Results: Request latency: min: 1ms max: 16.9ms median: 1.6ms p95: 2.6ms p99: 4.6ms Codes: 200: 10000
  • 25.
    ViewCount V2 -Results: Request latency: min: 1ms max: 16.9ms median: 1.6ms p95: 2.6ms p99: 4.6ms Codes: 200: 10000
  • 26.
    ViewCount V2 -Results: 20 billion commands! > 400GB in memory!
  • 27.
    Things to beaware of: 1. Redis Cluster shard maps - redirections, etc. Monitor redirections - gracefully restart workers after shard moves 2. AOF can slow down / fail large “redis-trib.rb” operations. Make sure to disable before / re-enable after! 3. Not all legacy systems support Redis Cluster, and if they do… They might not support it well (PHP-FPM)! 4. Over memory capacity behavior? Previously we would hard-crash - now we’d LRU old 1-view images. Neither are good, but for us, one is much less painful
  • 28.
    ViewCount V3? Approaching thepoint of minimal gains for man-hours, but what else might be fun? 1. Moving PHP7 off NodeJS API and directly to Redis Cluster Downsides: dealing with shard maps is complex is a stateless / process-per-request environment! 2. Using redis3's BITFIELD or HSet to save on key storage costs Downsides: complicate the system, reduce “hit-by-a-bus” issues - keys are just hashes, values are just counts! 3. Dealing with the nature of TCP Streams (TCP is not HTTP!) One connection to rule them all! - Node’s Cluster module helps, but perhaps Rust or Golang? Downsides: Vertical scaling is non-obvious on EC2
  • 29.
    ViewCount V2 -Results: Redis is: Faster - Imgur response time decreased ~50ms
  • 30.
    ViewCount V2 -Results: Redis is: Faster - Imgur response time decreased ~50ms Cheaper - EC2 cost reduced by 75%
  • 31.
    ViewCount V2 -Results: Redis is: Faster - Imgur response time decreased ~50ms Cheaper - EC2 cost reduced by 75% Simpler - No Java, no MR, no ZK, no third parties, just INCR + GET!
  • 32.
    Redis is: Faster -Imgur response time decreased ~50ms Cheaper - EC2 cost reduced by 75% Simpler - No Java, no MR, no ZK, no third parties, just INCR + GET! More fun! - I got to talk at RedisConf17! ViewCount V2 - Results:
  • 33.