The Big Picture
Aerospike is a distributed, scalable NoSQL database. It is architected with three key objectives:
To create a flexible, scalable platform that would meet the needs of today’s web-scale applications
To provide the robustness and reliability (ie, ACID) expected from traditional databases.
To provide operational efficiency (minimal manual involvement)
First published in the Proceedings of VLDB (Very Large Databases) in 2011, the Aerospike architecture consists of 3 layers:
The cluster-aware Client Layer includes open source client libraries that implement Aerospike APIs, track nodes and know where data reside in the cluster.
The self-managing Clustering and Data Distribution Layer oversees cluster communications and automates fail-over, replication, cross data center synchronization and intelligent re-balancing and data migration.
The flash-optimized Data Storage Layer reliably stores data in RAM and Flash.
Aerospike scales up and out on commodity servers. 1) Each multi-threaded, multi-core, multi-cpu server is fast 2) Each server scales up to manage up to 16TB of data, with parallel access to up to 20 SSDs 3) Aerospike scales out linearly with each identical server added, to parallel process 100TB+ across the cluster(AppNexus was in production with a 50 node cluster that has now scaled up by increasing SSD capacity per node and reducing node count to 12, in preparation for scaling back up on node counts and SSD capacity )- see http://www.aerospike.com/wp-content/uploads/2013/12/Aerospike-AppNexus-SSD-Case-Study_Final.pdf
SYNC replication within cluster for immediate consistency, #replicas from 1 to #nodes in cluster, typical deployment has 2 or 3 copies of data; Can support ASYNC within cluster. ASYNC replication across clusters using Cross Data Center Replication (XDR). Writes automatically replicated across one or more locations. Fine grained control - individual sets or namespaces can be replicated across clusters in master/master/slave complex ring and star topologies.
Row oriented
Aerospike is a row oriented key value store with high throughput and low latency primary key operations.
Fast
Like Redis and Memcache, but Aerospike scales linearly and easily without sharding. Developers concentrate on the business logic rather than the “plumbing”. Latencies of primary key operations of single digit milliseconds are routinely achieved, across the whole key space.
Complex types
Like Redis and MongoDB, Aerospike supports simple types like Integer and String plus it supports complex types Like List and Map. Complex types can nest inside each other accommodate documents like JSON.
Aerospike also supports collections called Large Data Types of Ordered List, Map, Stack and Set. Each element in a large data type is stored in a sub record, allowing these types to store millions of elements with the same access latency regardless of the size of the collection. Large Data Types are semantically equivalent to “infinite columns” in Cassandra.
Queries and Aggregations
Queries on secondary indexes provide a mechanism similar to traditional RDBMS – the “where” clause. These are particularly good for low selectivity queries.
The results of a Query can be piped into a stream that is fed into an Aggregation, this is essentially indexed MapReduce and can be done in near real time.
High performance service
The Aerospike Daemon is an optimized, high performance Linux service. It is multi-core/multi-socket aware and NUMA node aware. Because it is written in C it does not suffer from service interruption because of garbage collection.
Client Layer
The Aerospike “smart client” is designed for speed. It is implemented as an open source linkable library available in C, C#, Java, PHP, Go, node.js and Python, and developers are free to contribute new clients or modify them as needed. The Client Layer has the following functions:
Implements the Aerospike API, the client-server protocol and talks directly to the cluster.
Tracks nodes and knows where data is stored, instantly learning of changes to cluster configuration or when nodes go up or down.
Implements its own TCP/IP connection pool for efficiency. Also detects transaction failures that have not risen to the level of node failures in the cluster and re-routes those transactions to nodes with copies of the data.
Transparently sends requests directly to the node with the data and re-tries or re-routes requests as needed. One example is during cluster re-configurations.
This architecture reduces transaction latency, offloads work from the cluster and eliminates work for the developer. It also ensures that applications do not have to be restarted when nodes are brought up or down. Finally, it eliminates the need to setup and manage additional cluster management servers or proxies.
From Key to Node
Aerospike has a shared nothing architecture, all nodes in a cluster are identical and there is no single point of failure. The cluster scales elastically and linearly as nodes are added or removed.
Data distributed evenly across nodes using RIPEMD160
Distribution algorithm minimizes data movement when nodes come or go and cluster must be re-balanced. Proprietary paxos based consensus algorithm auto-discovers and efficiently manages cluster formation .
Read/writes continue non-stop during cluster re-organization.
Self-healing architecture includes auto-discovery of cluster changes, auto-failover, auto-rebalancing, background backups/restores and rolling upgrades ensure zero touch and zero downtime operations.
Only manual intervention required is for physical changes to cluster e.g. adding or upgrading hardware.
Cluster Layer
The Aerospike “shared nothing” architecture is designed to reliably stores terabytes of data with automatic fail-over, replication and cross data-center synchronization. This layer scales linearly and implements many of the ACID guarantees. This layer is also designed to eliminate manual operations with the systematic automation of all cluster management functions. It includes 3 modules:
The Cluster Management Module tracks nodes in the cluster. The key algorithm is a Paxos-like consensus voting process which determines which nodes are considered part of the cluster. Aerospike implement special heartbeat (active and passive) to monitor inter-node connectivity.
When a node is added or removed and cluster membership is ascertained, each node uses distributed hash algorithm to divide the primary index space into data 'slices' and assign owners. Data Migration Module then intelligently balances the distribution of data across nodes in the cluster, and ensures that each piece of data is duplicated across nodes and across data centers, as specified by the system’s configured replication factor.
Division is purely algorithmic, the system scales without a master and eliminates the need for additional configuration that is required in a sharded environment.
The Transaction Processing Module reads and writes data as requested and provides many of the consistency and isolation guarantees. This module is responsible for
Sync/Async Replication : For writes with immediate consistency, it propagates changes to all replicas before committing the data and returning the result to the client.
Proxy : In rare cases during cluster re-configurations when the Client Layer may be briefly out of date, it transparently proxies the request to another node.
Duplicate Resolution : when a cluster is recovering from being partitioned, it resolves any conflicts that may have occurred between different copies of data. Resolution can be configured to be
Automatic, in which case the data with the latest timestamp is canonical
User driven, in which case both copies of the data can be returned to the application for resolution at that higher level.
Once you have the first cluster up, you can optionally install additional clusters in other data centers and setup cross data-center replication – this ensures that if your data center goes down, the remote cluster can take over the workload with minimal or no interruption to users.
Data Distribution
Aerospike Database has a Shared-Nothing architecture: every node in an Aerospike cluster is identical, all nodes are peers and there is no single point of failure.
Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. This algorithm uses an extremely random hash function that ensures that partitions are distributed very evenly within a 1-2% error margin.
To determine where a record should go, the record key (of any size) is hashed into a 20-byte fixed length string using RIPEMD160, and the first 12 bits form a partition ID which determines which of the partitions should contain this record. The partitions are distributed equally among the nodes in the cluster, so if there are N nodes in the cluster, each node stores approximately 1/N of the data.
Because data is distributed evenly (and randomly) across nodes, there are no hot spots or bottlenecks where one node handles significantly more requests than another node. The random assignment of data ensures that server load is balanced.
For reliability, Aerospike replicates partitions on one or more nodes. One node becomes the data master for reads and writes for a partition and other nodes store replicas.
Synchronous replication ensures immediate consistency and no loss of data. A write transaction is propagated to all replicas before committing the data and returning the result to the client. In rare cases during cluster re-configurations when the Aerospike Smart Client™ may have sent the request to the wrong node because it is briefly out of date, the Aerospike Smart Cluster™ transparently proxy's the request to the right node. Finally, when a cluster is recovering from being partitioned, it resolves any conflicts that may have occurred between different copies of data. Resolution can be configured to be automatic, in which case the data with the latest timestamp is canonical or both copies of the data can be returned to the application for resolution at that higher level.
Immediate consistency – by default
Master node for a record will coordinate with the replica node(s) before committing the write to RAM. These are then asynchronously written to storage on master and replica nodes. As an option, administrators can configure the master to write to the replica asynchronously, but this is a very uncommon case..
Aerospike implements efficient row level locking (latches) for reads and writes. Aerospike implements synchronous replication (immediate consistency) across multiple copies of the same data records within a single cluster by default, the most common deployment mode.Aerospike partitioning and clustering algorithms use automatic locking and latching to ensure consistency of the data across shards, partitions and databases.
A node is added to the cluster
When a node is added to the cluster, to provide increases capacity or throughput, the existing nodes use the heartbeat to locate the new node. The cluster then begins the process of data migration so that the new node becomes data master for some partitions and replica for other partitions. The client also detects these changes automatically and future requests are sent to the correct node.
What happens when a node is removed?
If a node has a hardware failure or it is being upgraded the inverse happens. On a 4 node cluster 1 node is the data master for about 1/4 of the data and those partitions now exist only on the replicas on nodes. The remaining nodes automatically perform data migration to copy the replica partitions to new data masters.
At the same time, your application, through the Smart Client™ becomes aware that the node has failed and the client automatically calculates the new partition map.
Automatic rebalancing
The data rebalancing mechanism (ensures that query volume is distributed evenly across all nodes, and is robust in the event of node failure happening during rebalancing itself. The system is designed to be continuously available, so data rebalancing doesn't impact cluster behavior. The transaction algorithms are integrated with the data distribution system, and there is only one consensus vote to coordinate a cluster change. With only one vote, there is only a short period when the cluster internal redirection mechanisms are used while clients discover the new cluster configuration. Thus, this mechanism optimizes transactional simplicity in a scalable shared-nothing environment while maintaining ACID characteristics.
By not requiring operator intervention, the cluster will self-heal even at demanding times. The capacity planning and system monitoring provide you the ability to handle virtually any unforeseen failure with no loss of service. You can configure and provision your hardware capacity and set up replication/synchronization policies so that the database recovers from failures without affecting users.
Data migration can be tuned to meet your operational requirements see http://www.aerospike.com/docs/operations/tune/migration/
Clustering
Aerospike uses distributed processing to ensure data reliability. In theory, many databases could actually get all of the required throughput from a single server with a huge SSD. However this would not provide any redundancy and if the server went down, all database access would stop. So a more typical configuration is several nodes, with each node having several SSD devices. Aerospike typically runs on fewer servers than other databases.
The distributed, shared-nothing architecture means that nodes can self-manage and coordinate to ensure that there are no outages to the user, while at the same time being easy to expand your cluster as traffic increases.
Heartbeat
The nodes in the cluster keep track of each other through a heartbeat so that they can coordinate among themselves. The nodes are peers – there is no one node that is the master, all of the nodes track the other nodes in the cluster. When nodes are added or removed, it is detected by the nodes in the cluster using heartbeat mechanism. Aerospike have two way of defining cluster
Multicast in this case multicast IP:PORT is used to broadcast heartbeat message
Mesh in this case address of one Aerospike server is used to join cluster
High speed distributed consensus
Aerospike's 24/7 reliability guarantee starts with its ability to detect cluster failure and quickly recovering from that situation to reform the cluster.
Once the cluster change is discovered, all surviving nodes use a Paxos-like consensus voting process to determine which nodes form the cluster and Aerospike Smart Partitions™ algorithm to automatically re-allocate partitions and re-balance. The hashing algorithm is deterministic – that is, it always maps a given record to the same partition. Data records stay in the same partition for their entire life although partitions may move from one server to another.
All of the nodes in the Aerospike system participate in a Paxos distributed consensus algorithm, which is used to ensure agreement on a minimal amount of critical shared state. The most critical part of this shared state is the list of nodes that are participating in the cluster. Consequently, every time a node arrives or departs, the consensus algorithm runs to ensure that agreement is reached.
This process takes a fraction of a second. After consensus is achieved each individual node agrees on both the participants and their order within the cluster. Using this information the master node for any transaction can be computed along with the replica nodes.
Management and Monitoring
Aerospike provides a sophisticated management console: Aerospike Management Console(AMC). Through AMC you can manage and monitor every aspect of your cluster. Plugins are also available for Graphite and Nagios. You can also “roll you own” tooling with the comprehensive “Info” APIs.
Handling Traffic Saturation
A detailed discussion of the network hardware required to handle peak traffic loads is beyond the scope of this course. Aerospike Database provides you with monitoring tools that let you evaluate bottlenecks. If the network is bottlenecking, then the database will not be running at capacity and requests will be slow.
Handling Capacity Overflows
We have lots of suggestions for capacity planning, managing your storage and monitoring the cluster to ensure that your storage doesn't overflow, but in the case that it does, Aerospike reaches the stop writes limit – in that case, no more new records are accepted. But data updates and reads will continue to be processed.
In other words, even beyond optimal capacity, the database does not stop handling requests, it continues to do as much as possible to keep processing user requests.
Primary Index
In an operational database, the fastest and most predictable index is the primary key index. This index provides the most predictable and fastest access to row information in a database.
Aerospike's primary key index is a blend of distributed hash table technology with a distributed tree structure in each server. Entire keyspace in a set (table) is partitioned using a robust hash function into partitions. There are total of 4096 partitions and are equally distributed across nodes in the cluster. Please see data-distribution for details of hashing and partitioning.
At the lowest level, a red-black in-memory structure is used, similar to the data structures used in the popular MemCache system.
Primary Index is on the 20 byte hash (or Digest) of the specified primary key. While this expands the key size of some records (which might have, for example, a unique 8 byte key), it is beneficial because the code works predictably regardless of input key size, or input key distribution.
When a single server fails, the indexes on a second server are immediately available. If the failed server remains down, data starts rebalancing. As the data rebalances, the index is computed on the new server as the data arrives.
Index Metadata
Each index entry currently requires 64 bytes. Along with digest following metadata is also stored in index.
Write generation - or the vector clock: This tracks the all the updates to the key in the system. This is used while resolving conflicting updates
Time to live: This tracks the life of a key in system. This is time at which key should expire and is used by eviction subsystem.
Storage address : The location on storage (both in-memory and persistence) for the data.
Index Persistence
To maintain throughput, primary indexes are not committed to storage - only to RAM. This allows writes to be very high performance, and highly parallel. The data storage layer may also be configured not to use storage.
When Aerospike server starts, it walks through data on the storage and creates primary index for all the data partitions.
Fast Restart
In order to support fast upgrades of clusters with no downtime, Aerospike supports a feature called "fast restart". The index entries are allocated from slabs of memory, which are stored as shared memory. When "fast restart" is configured, the index memory is allocated from a Linux shared memory segment. In case of planned shutdown and restart of Aerospike (e.g upgrades), on restart server simply attaches to the shared memory segment and brings primary indexes live. This does not require any data scan in the storage
Single bin integer optimization
Storing a key-value entry where the value is a single integer can be very useful. When the "single bin" feature is enabled on a namespace - providing lower memory use than supporting the Aerospike bin structure on each record - a further optimization is available. When the stored value is an integer, the space which would point to the bin information is reused as the integer. For this special case, the amount of storage required for integer values is simply the space required for the index.
Secondary Index
Secondary indexes are on the non-primary key, which gives ability to model one to many relationships. Indexes are specified on a bin-by-bin (like columns in an RDBMS) basis. This allows efficient updates and minimizes the amount of resources required to store the indexes.
A data description (DDL) is used to determine which bins and type are to be indexed. Indexes can be created and removed dynamically through provided tools or the API. This DDL, which appears similar to an RDBMS schema, is NOT used for data validation - even if a bin is in the DDL as indexed, that bin is optional. For an indexed bin, updating the record to include the bin updates the index.
For example, an index can be created only on strings OR integers. Consider the case where you have a bin for storing a user's age and the age value is stored as a string by one application and an integer by another application. An integer index will exclude records which contain string values stored in the indexed bin while a string index would exclude records with integer values stored in the indexed bin. Secondary indexes are:
Stored in RAM for fast look-up
Built on every node in cluster and is collocated with the primary. Each secondary index entry contains references to records local to the node only.
Secondary indexes contain pointers to both master records and replicated records in a cluster.
Data Structure
Like the Aerospike primary index, the secondary index is blend of a hash table with a B-tree. The secondary index is structurally a Hash of B-tree of B-trees. Each logical secondary index has 32 physical trees. When a key is identified for index operation, a hash function is used to determine which of the physical trees the secondary index entry needs to be made, out of 32 physical trees. After identifying B-tree updates are made into the tree, note that corresponding to the single secondary index key there can be many primary record references. Thus, the lowest level of this structure is a list of full fledged B-trees which is based on the number of records which needs to be stored.
Index Creation
Aerospike supports dynamic creation of secondary indexes. Tools such as aql are available which read the indexes currently available, and allow creation and destruction of indexes.
To build a secondary index, users need to specify a namespace, set, bin and type of the index (e.g., integer, string, etc.).
One receiving confirmation from SMD (see above) each node creates the secondary index in write-active mode and starts a background scan job, which scans all of the data and inserts entries into the secondary index.
Index entries is only be created for records that match all of the index specifications.
The scan job that populates the secondary index will interact with read/write transactions in exactly the same way a normal scan would, except there is no network component to the index creation scan unlike for the normal scan. During the index creation period, all new writes that affect the indexing attribute will update the index.
As soon as the index creation scan is completed and all of the index entries are created, the index is immediately ready for use by queries and is marked read-active.
Secondary index is available for normal querying after index has been created successfully on all the nodes.
User Defined Function
In SQL databases, a user-defined function provides a mechanism for extending the functionality of the database server by adding a function that can be evaluated in SQL statements.
UDFs allow the compute to move close to data. The function is run on the same server where the data resides.
Data Storage Layer
Aerospike is a key-value store with schema-less data model. Data is organized into policy containers called namespaces, semantically similar to databases in an RDBMS system. Within a namespace, data is subdivided into sets (similar to tables) and records (similar to rows). Each record has an indexed key that is unique in the set, and one or more named bins (similar to columns) that hold values associated with the record.
Sets and bins do not need to be defined up front, but can be added during run-time for maximum flexibility.
Values in bins are strongly typed, and can include any of supported data-types. Bins themselves are not typed, so different records could have the same bin with values of different types.
Indexes ( primary keys and secondary keys ) are stored in RAM for ultra-fast access and values can be stored either in RAM or more cost-effectively on SSDs. Each namespace can be configured separately, so small namespaces can take advantage of DRAM and larger ones gain the cost benefits of SSDs.
The Data Layer was particularly designed for speed and a dramatic reduction in hardware costs. It can operate all in-memory, eliminating the need for a caching layer or it can take advantage of unique optimizations for flash storage. In either case, data is never lost.
100 Million keys take up only 6.4GB. Although keys have no size limitations, each key is efficiently stored in just 64 bytes.
Native, multi-threaded, multi-core Flash I/O and an Aerospike log structured file system take advantage of low level SSD read and write patterns. In addition, writes to disk are performed in large blocks to minimize latency. This mechanism bypasses the standard file system, historically tuned to rotational disks.
Also built-in are a Smart Defragmenter and Intelligent Evictor . These processes work together to ensure that there is space in RAM and that data is never lost and safely written to disk.
The Defragmenter tracks the number of active records in each block and reclaims blocks that fall below a minimum level of use.
The Evictor removes expired records and reclaims memory if the system gets beyond a set high water mark. Expiration times are configured per namespace, the age of a record is calculated from the last time it was modified, the application can override the default lifetime at any time and specify that a record should never be evicted.
Hybrid Storage
The Hybrid Memory System holds the indexes and data stored in each node, and handles interactions with the physical storage. It also contains modules that automatically remove old data from the database, and defragment the physical storage to optimize disk usage.
Aerospike can store data in RAM or on Flash (SSD), and each namespace can be configured separately. This configuration flexibility allows an application developer to put a small namespace that is frequently accessed in RAM, but put a larger namespace in less expensive storage such an SSD.
Significant work has been done to optimize data storage on SSDs, including bypassing the file system to take advantage of low-level SSD read and write patterns.
Philosophy
Other than Large Data Types, all the data for a record is stored together. The amount of storage for each row is limited by default to 1 MB in size, which can be increased.
Storage is copy-on-write, with free space being reclaimed by the defragmentation process.
Each namespace is configured with a fixed amount of storage. Each node must have the same namespaces in each server, and the same amount of storage for each namespace.
Storage can be configured with pure RAM without persistence, RAM with storage persistence, or using Flash storage (SSDs).
Eviction based on storage
The Defragmenter tracks the number of active records on each block on disk, and reclaims blocks that fall below a minimum level of use. The Evictor is responsible for removing references to expired records, and for reclaiming memory if the system gets beyond a set high water mark. When configuring a namespace, the administrator specifies the maximum amount of RAM used for that namespace, as well as the default lifetime for data in the namespace. Under normal operation, the Evictor looks for data that has expired, freeing the index in memory and releasing the record on disk. The Evictor also tracks the memory used by the namespace, and releases older, although not necessarily expired, records if the memory exceeds the configured high water mark. By allowing the Evictor to remove old data when the system hits its memory limitations, Aerospike can effectively be used as an LRU cache. Note that the age of a record is measured from the last time it was modified, and that the Application can override the default lifetime any time it writes data to the record. The Application may also tell the system that a particular record should never be automatically evicted.
Data in RAM
Data purely in RAM - without persistence - has the benefits of higher throughput. Although modern Flash storage is very high performance, RAM is even higher performance, and prices of RAM are dropping.
Data is allocated through the JEMalloc allocator. JEMalloc allows allocation into different pools. Long term allocations - such as those for the storage layer - can be allocated separately. We have found the JEMalloc allocator has exceptional properties in terms of low fragmentation.
By using multiple copies of RAM, very high levels of reliability can be obtained. Since Aerospike automatically re-shards and replicates data on failure or cluster node addition, a high level of "k-safety" can be obtained. Bringing a node back online automatically populates its data from one of the copies.
Due to Aerospike's random data distribution, data unavailability when several nodes are lost can be quite small. For example, in a 10 node cluster with two copies of the data, if two nodes are lost quickly, then the amount of data unavailable is before replication is about 2% - 1/50 of the data.
When a persistence layer is configured, reads always occur from the RAM copy. Writes occur through the data path as described below.
Data on SSD / Flash
When data is to be written, a write latch is taken for the row, to avoid two conflicting writes on the same record. In some cluster states, data may need to be read from other nodes and conflicts resolved.
After the write is validated, the in-memory representations of the record are updated on the master. The data that will be written is added to a write buffer. If the write buffer is full, it is queued to disk. Depending on the size of the write buffer - which is the same as the maximum row size - and the throughput of writes, there will be a certain risk of uncommitted data.
If there are replicas, they are updated, and their in-memory indexes are updated as well. After data has been updated on all copies in-memory, the result is returned to the client.
The system can be configured to return a result to the client before all writes - delayed consistency.
Flash optimizations
The Defragmenter tracks the number of active records on each block on disk, and reclaims blocks that fall below a minimum level of use. The defragmentation system continually scans available blocks, and looks for blocks which are more than a certain amount free.
Wear Leveling
[words here]
Block size
[words here]
Copy on write
[words here]
Continuous background defrag
[words here]
Even distribution across devices
No RAID
Reading a record
The entire record is read from storage into the server RAM, only the requested Bins are returned to the client.
The process:
Client finds Master Node from partition map.
Client makes read request to Master Node.
Master Node finds data location from index in RAM.
Master Node reads entire record from flash. (true even if only reading bin)
Master Node returns value.
Writing a new record
Entries are added into the primary and any secondary indexes. The record is placed in a write buffer to be written to the next available block.
The process:
Client finds Master Node from partition map.
Client makes write request to Master Node.
Master Node make an entry indo index (in RAM) and queues write in temporary write buffer.
Master Node coordinates write with replica nodes (not shown).
Master Node returns success to client.
Master Node asynchronously writes data in blocks.
Index in RAM points to location on flash.
Updating a record
The record is read in to server RAM, updated as needed, then written to a new block (copy on write).
The index(es) are updated to point to the new location. The storage space of the old record is recovered a some future time by the automatic background defragmentation.
The process:
Client finds Master Node from partition map.
Client makes update request to Master Node.
Master Node reads the existing record
Master Node queues write of updated record in a temporary write buffer
Master Node coordinates write with replica nodes (not shown).
Master Node returns success to client.
Master Node asynchronously writes data in blocks.
Index in RAM points to new location on flash.
Deleting a record in Flash
Deleting a record removes the entries from the indexes only. Very fast
The background defragmentation will physically delete the record at a future time.
The process:
Client finds Master Node from partition map.
Client makes delete request to Master Node.
Master node removes the entry
from index(es)
Master Node coordinates deletion with replica nodes (not shown).
Master Node returns success to client.
Data is eventually deleted from flash by defragmentation.
XDR Architecture
XDR - Cross Datacenter Replication - is an Aerospike feature which allows one cluster to synchronize with another cluster over a longer delay link. This replication is asynchronous, but delay times are often under a second. Each write is logged, and that log is used to replicate to remote clusters. Each master for a partition is responsible for replicating that partition's writes, and the current write status is replicated to the local cluster's partition replica, so when that replica is promoted, it can take over where the previous master left off.
An XDR process runs on every node of the cluster along side the Aerospike process. The main task of the XDR process is to monitor database updates on the "local" cluster and send copies of all write requests to one or more "remote" destination clusters. XDR determines the remote cluster details from the main Aerospike configuration files. It is possible to ship different namespaces (or even sets) to different clusters. In addition, the XDR process takes care of failure handling. Failures can be either local node failures or link failure to the remote cluster or both.
Each node runs:
Aerospike daemon process (asd)
XDR daemon process (xdr), which is composed of the following:
Logger module
Shipper module
Failure Handler module
An XDR process runs on every node of the cluster. All the writes/updates that happen on each node in the local cluster need to be forwarded to one or more remote clusters. In order to do this efficiently, XDR uses a log of all digests of the keys that need to be shipped. When XDR writes to the remote cluster, it will get the current value of the key at that time. This is done so that if there are multiple changes to the same record, only the latest value is sent; intermediary values are not. XDR does not store any values of the record in the digest log.
The details of this are that writes and updates by the main database system are communicated with its XDR process over a named pipe. This communication is received by the Logger module which stores this information in the digest log. To keep the digest log file small, only minimal information (the key digest) is stored in the digest log – just enough to ship the actual records at a later time. The Digest Log is a ring buffer file. i.e the most recent data will potentially overwrite the oldest data (if the buffer is full). This allows us to keep a check on this file size. But this also introduces a capacity planning issue. You need to configure a big enough digest log file to avoid the loss of any pending data to be shipped. You should plan for long term outages due to problems communicating between your data centers. For instance, in the event one of the data centers is down due to weather or earthquake.
XDR Topologies
XDR can be configured with various topologies such as simple active-active, active-passive, star, and combination of these. The star topology allows one data center to simultaneously replicate data to multiple data centers. This complements Aerospike’s support for complex multi-master ring replication topologies. You companies have unprecedented flexibility to implement the replication approach that best supports your needs.
Simple Active-Passive
In active-passive topology, initial writes happen only at one cluster. The remote cluster is a stand-by cluster which can be used for reads. The system does not have a mechanism to stop the application writes (non-xdr writes) on the remote cluster. The application should therefore not write any data directly on the destination cluster. The XDR shipping on the destination cluster should be disabled using the config option "enable-xdr" so that any inadvertent writes on the remote cluster are not shipped out to other clusters. This option may be used when intense analysis of data is needed, but you do not want to affect performance on the main cluster. The main cluster should be configured as the active one and the analysis cluster as passive.
Simple Active-Active
In active-active topology, writes can happen at both clusters. The writes happening on one cluster will be forwarded to the other. XDR takes care that the data shipped is not forwarded further, thereby avoiding an infinite loop of back and forth writes between clusters. As XDR allows writes to happen on both nodes, the same key can be updated on both clusters at the same time, thereby leading to inconsistent data. As the Aerospike database is agnostic to the application data, it cannot auto-correct this inconsistency. The window in which this inconsistency can happen is between the time the first write happened and the time at which it is shipped and rewritten to the second cluster, i.e, if the write happens on the remote cluster in this time window, it will lead to inconsistency which cannot be automatically resolved by the server. However, the subsequent update operation will again force a sync of the data, assuming the same concurrent write behavior does not repeat itself. In cases where the same record may be updated in two different clusters at the same time, this mechanism may not be suitable. In many other cases, primary writes to a record will be strongly associated with a single cluster and the other clusters are acting as a hot backup. These are very good use cases for active-active.
For example, a company has users spread across North America. Traffic is divided between the West Coast and East Coast data centers. While it is possible for a user normally on the West Coast to be traveling to the East Coast and thus be serviced by the East Coast data center, it is unlikely that writes for this user will occur simultaneously in both data centers at the same time. This is a very good use case for the active-active topology.
Star Replication
XDR can ship data to multiple destination clusters. You can achieve this by simply specifying multiple destination clusters in the namespace configuration. Each destination cluster is specified in a separate line with the config parameter xdr-remote-datacenter. This is most commonly used when data will be centrally published and replicated in multiple locations for low-latency read access from local systems.
Failure Handling
XDR will manage different kinds of failures:
Local node failure
Remote link failure
Combinations of the above
Local node failure
In the case of a local node failure the replica nodes are used to synchronize data with the remote cluster. The Aerospike cluster maintains the state of each node and the data for which it is primary and replica. If a node fails, the cluster will automatically continue the XDR writes using the replica data on the other nodes.
Remote link failure.
If the connection between the local and remote clusters stops, XDR keeps track of the span of time(s) for which the link was unavailable. Then when the link becomes available, XDR ships data for that link. XDR within a cluster keeps track of the last ship time by any node in the cluster.
When XDR is configured to use star topology, a cluster can ship to multiple data centers simultaneously. In this case one or more data center links can go bad. XDR continues to ship to the available data centers. It remembers the times when the links are down for each configured data center. When the link becomes available XDR ships the current writes as well as the writes from the period when the link was down. XDR also handles more complex scenarios, such as local node failures combined with remote link failures.
Combinations of failures
XDR also handles combinations of failures such as local node down with remote link failure, failure of link when the XDR is shipping historical data etc.
Per namespace shipping
An Aerospike node can have multiple namespaces. You can configure different namespaces to ship to different remote clusters. The following is an example which demonstrates this. DC1 is shipping namespaces NS1 and NS2 to DC2 and shipping namespace NS3 to DC3. This flexibility has helped many customers to set replication rules differently for different sets of data.
You might notice in above example that the writes to namespace NS1 could keep looping around the clusters forming a ring. To avoid this, XDR has an option to stop forwarding the writes which are coming from another XDR.
Finer shipping control with sets
XDR can be configured to ship certain sets (analogous to tables) to a data center. This allows even finer control over which data to be shipped. So the combination of namespace and set will determine whether to ship a record or not. This is often done because not all data in a namespace in a local cluster needs to be replicated in other clusters.
Compression
XDR has an option to compress the data before shipping. It helps to save the bandwidth between data centers. The administrator can configure the minimum threshold size of the record to be compressed before shipping and only records larger than the minimum size will undergo compression.