4. Who uses Nighthawk?
Some of our biggest customers:
Analytics services - Ads, Video
Ad serving
Ad Exchange
Direct Messaging
Mobile app conversion tracking
5. Design Goals
Scalable: scale vertically and horizontally
Elastic: add / remove instances without violating SLA
High throughput and low latencies
High availability in the event of machine failures
Topology agnostic client
8. Cluster manager
Manages topology membership and changes
- (Re)Balances replicas
- Reacts to topology changes, eg: dead node
- Replicated cache - ensures 2 replicas of same partition are on separate
failure domains
9. Redis databases for partitions
Partition -> Redis DB
Granular key remapping
Logical data isolation
Enumerating - redis db scan
Deletion - flushdb
Enables replica rehydration
K1 K4K2 K3
Partition X Partition Y
1 2
19. High Availability with Replication
Synchronous, best effort
RF = 2, Intra DC
Supports idempotent operations only - get, put, remove, count, scan
Copies of a partition never on the same host and rack
Passive warming for failed/restarted replicas
20. High Availability with Replication
Client
Proxy/Routing layer
Backend 0
Partition 2,5,9
Topology
Cluster
manager
GetKey in
Partition 5
GetKey in
Partition 5
SERVING
Backend N
Partition
12,5,10
SERVINGFAILED
Backend N*
Partition 12,5,10
WARMING
SetKey in
partition 5
Pool A Pool B
24. Hot Key Mitigation
Server side diagnostics:
Sampling a small % of requests and logging
Post processing the logs to identify high frequency keys
Client side solution:
Client side hot key detection and caching
Better to have:
Redis tracks the hot keys
Protocol support to send feedback to client if a key is hot
25. Active warming of replicas
Client
Proxy/Routing layer
Topology
Cluster
manager
Backend A
Partition 2,5,9
SERVING
Backend B*
Partition 12,5,10
WARMING
writes
Bootstrapper
Pool A
Pool B
Each major service gets it’s own cache cluster.
2 modes of operation - replicated and non replicated.
Analytics services - Ads, Video - Ad engagement analytics, video ad engagement analytics
Mobile app conversion tracking - tracks conversions like promoted app installs, in-app purchases and signups
Ad serving - performs ad matching, scoring, and serving
Ad Exchange - real time bidding for ads
DM - direct messaging
Interaction metrics service - provides different types of engagement metrics by tweet or by user
Routing layer subscribes to topology changes and updates it’s current mapping of partition to redis node. For every request, it hashes the key and finds out which partition the key belongs to. It then figures which redis node it is mapped to and forwards the request to the appropriate redis.
Each backend can have 1 or more redises. Since redis is single threaded, to increase throughput per container and fully utilize the resources allocated to the container- like bandwidth, CPU, RAM, the backend can have more than 1 redis. The backends also have a topology component that announces the currently running redis nodes.
The cluster manager is in charge of creating partitions and managing topology. It is responsible for balancing replicas of partitions evenly across nodes, ensuring no replicas of the same partition are not down at the same time during managed data movement, ensuring dead nodes are removed from the topology after the partitions assigned to them have been successfully assigned to currenty available nodes. It also takes care of rate limited data movement from current nodes to newly joined nodes ensuring clients don’t see a huge number of cache misses as soon as the cluster is expanded.
Trade off:
Additional hop in proxy layer - for a topology agnostic client
Runs in mesos containers
Can have 1 or more redis instances running in each container
Number of redis nodes per container - bound by server resources, amount of data to be store and data density per node.
Announces information about the redis instances running to the topology
Information: DC, host, port, device type, capacity …
Capacity of a node - also can be referred to as weight - refers to how much data can be stored
Watches and reacts to topology changes like new replica assigned to a local redis, or replica moving to a remote redis.
Manages all the participants in the topology and maintains the sanity of the cluster
Ensures every partition has a replica residing on an available node
Balances replicas/partitions across nodes of the cluster. If nodes have different capacity, the number of replicas assigned to the nodes are proportional to their capacity
Unit of data movement is much smaller - Moving 1/N keys in a redis vs a db in redis
Moving a replica/partition is dropping all keys in a db in one redis and remapping the keys to another db in another redis
Adding new nodes right away, causes Count(Keys)/Count(Nodes) to get remapped and will see a cache miss for those requests, hitting hard on the persistent storage.
If proper checks and balances exist, persistent storage will rate limit the requests, or just serve with higher latencies and degraded throughput.
In either case, clients will see errors and hit timeouts, thus undergoes Success rate degradation.
There is no intelligent balancing if there is a higher config redis node, unless your have some sort of balancing logic inside the client. What an overload!
If proxy layer is the bottleneck, you can add more proxy instances.
If backends are the bottleneck, you can add more backends.
Your persistent storage and the storage team will thank you for rate limiting how much traffic you send to it.
State of the partitioning at the end of balancing.
Topology schemes - you could use ZK in combination with consistent hashing, or maintain a changelog to store topology, or move to a totally different method for representing and storing topology.
Clients don’t need to know about it.
CLients don’t have to worry about replication factor, or how replication happens.
New Administrative workflows can be added - automating rolling restart, node maintenance, migration with the help of CM.
Why use replication?
Data analytics pipeline
Need to store real time data that have a relatively shorter lifetime (until batch jobs catch up)
Computations are expensive to recompute on cache-miss
User session data for current day
Data lifetime of a day
Expensive to store in a persistent key value store for the desired latency/throughput requirements
Serves business goals for half the cost with better latencies.
Trade offs
RF > 2, adds to latency and cost
Non idempotent operations not supported - incr/ decr
Show writes when both are serving.
Hot keys:
Ellen’s tweet is a classic example of how a popular key snowballs into a hotkey.
Key that gets a disproportionately high number of QPS.
Manifests as a very busy cache server, slowing it down further, can result in b/w saturation if the value is large, and can result in packet drops, and client side timeouts.
Quickly re-populating a warming replica using a serving copy
Easy solution:
Do nothing, rely on organic population of data on writes
A better solution:
Read data from a serving replica and write to the warming replica
Rate limit copy to not impact production traffic latency and throughput