Aerospike Architecture

•Download as PPTX, PDF•

6 likes•4,606 views

Peter Milne

This is an architectural overview of Aerospike, a high performance NoSQL database

Technology

IN-MEMORY + NOSQL + ACID
DEVELOPING WITH AEROSPIKE
ARCHITECTURE OVERVIEW
Aerospike aer . o . spike [air-oh- spahyk]
noun, 1. tip of a rocket that enhances speed and stability
© 2014 Aerospike. All rights reserved. Confidential 1

© 2014 Aerospike. All rights reserved. Confidential 2
Objectives
This module provides an overview of the Aerospike architecture. At the end
of this module you will have a high level understanding of
■ Client
■ Cluster
■ Storage
■Primary & Secondary indexes
■RAM
■Flash
■ Cross Datacenter Replication (XDR)

© 2014 Aerospike. All rights reserved. Confidential 3
The Big Picture

© 2014 Aerospike. All rights reserved. Confidential 4
Aerospike Goals
Aerospike technology goals are to meet these challenges:
■ Handle extremely high rates of read/write transactions over persistent
data
■ Avoid hot spots to maintain tight latency SLAs
■ Provide immediate consistency with replication
■ Ensure long running tasks do not slow down transactions
■ Scale linearly as data sizes and workloads increase
■ Add capacity with no service interruption

Architecture – The Big Picture
© 2014 Aerospike. All rights reserved. Confidential 5
1) No Hotspots
– DHT simplifies data
partitioning
2) Smart Client – 1 hop to data,
no load balancers
3) Shared Nothing Architecture,
every node identical
7) XDR – sync replication across
data centers ensures Zero Downtime
4) Single row ACID
– synch replication in cluster
5) Smart Cluster, Zero Touch
– auto-failover, rebalancing,
rack aware, rolling upgrades..
6) Transactions and long running
tasks prioritized real-time

What Aerospike offers
© 2014 Aerospike. All rights reserved. Confidential 6
■ Row oriented
■Key value store
■ Fast
■Like Redis and Memcache
■Whole key space
■ Complex types
■Like Redis and MongoDB
■List/Map/JSON
■Large Data types
■ Queries and Aggregations
■Secondary index
■Sub-second MapRedude
■ High performance
■Linux Daemon
■Multi-core aware
■Multi-socket aware
■No Garbage collection issues
■ Run anywhere
■Cloud friendly
■TCP networking
■ Flash optimized
■Near-DRAM Flash performance

© 2014 Aerospike. All rights reserved. Confidential 7
Client

© 2014 Aerospike. All rights reserved. Confidential 8
Smart Client™
■ The Aerospike Client is
implemented as a library, JAR or
DLL, and consists of 2 parts:
■Operation APIs – These are the
operations that you can execute on the
cluster – CRUD+ etc.
■First class observer of the Cluster –
Monitoring the state of each node and
aware on new nodes or node failures.

© 2014 Aerospike. All rights reserved. Confidential 9
From Key to Node
■ Distributed Hash Table with No Hotspots
■Every key hashed with RIPEMD160
into an ultra efficient 20 byte (fixed length) string
■Hash + additional (fixed 64 bytes) data
forms index entry in RAM
■Some bits from hash value are used to
calculate the Partition ID (4096 partitions)
■Partition ID maps to Node ID in the cluster
■ 1 Hop to data
■Smart Client simply calculates Partition
ID to determine Node ID
■No Load Balancers required

© 2014 Aerospike. All rights reserved. Confidential 10
Cluster

© 2014 Aerospike. All rights reserved. Confidential 11
The Cluster (servers)
■ Local servers
■XDR to remote cluster
■ Automatic Load balancing
■ Quick fail over
■ Detects new nodes (multicast)
■ Rebalances data (measured rate)
■ Add nodes under load
■ Rack awareness
■ “proxy” to correct node
■ Locally attached storage

© 2014 Aerospike. All rights reserved. Confidential 12
Data Distribution
Data is distributed evenly across nodes in a cluster using the Aerospike
Smart Partitions™ algorithm.
■ RIPEMD160 (no collisions yet found)
■ 4096 Data Partitions
■ Even distribution of
■Partitions across nodes
■Records across Partitions
■Data across Flash devices
■ Primary and Replica
Partitions

© 2014 Aerospike. All rights reserved. Confidential 13
Single Row ACID
Writing with Immediate Consistency
1. Write sent to record master
2. Latch against simultaneous writes
3. Apply write synchronously to master memory and replica memory
4. Queue operations to storage
5. Signal completed transaction
(optional storage commit wait)
6. Master applies conflict resolution policy
(rollback/ roll forward)

Automatic rebalancing
Adding, or Removing a node, the Cluster automatically
rebalances
1. Cluster discovers new node via gossip protocol
2. Paxos vote determines new data organization
3. Partition migrations scheduled
4. When a partition migration starts,
write journal starts on destination
5. Partition moves atomically
6. Journal is applied and source data deleted
After migration is complete, the Cluster is evenly
balanced.
© 2014 Aerospike. All rights reserved. Confidential 14

© 2014 Aerospike. All rights reserved. Confidential 15
Cluster formation
…and the Cluster forms
Individual nodes go in…
Gossip protocol
Heartbeat
PAXOS

Aerospike Management and Monitoring
Aerospike provides a sophisticated management console: Aerospike
Management Console (AMC).
Plugins also are available for:
■ Graphite
■ Nagios
APIs for monitoring
■ Roll you own tool
■ 100’s of monitoring parameters
■ Fine grained latency monitoring
© 2014 Aerospike. All rights reserved. Confidential 16

© 2014 Aerospike. All rights reserved. Confidential 17
Primary Index
Primary index
■Hash of Hash of rb trees
■ Index
■64 bytes
■Write generation
■Time To Live
■Storage address
■Uses shared memory for
Fast Restart
■ Single bin
■Optimization for minimal data volume
■If using integers, store data in index
rather than in storage

Primary Index – Single Bin
Storing a key-value entry where the value is a single integer can be very
useful. A namespace can be configured, and optimized, for a single Bin
record.
■ No Bin management overhead
■ Integer – stored in Index for free (data-in-index)
© 2014 Aerospike. All rights reserved. Confidential 18

© 2014 Aerospike. All rights reserved. Confidential 19
Secondary Indexes
■ Bin (Column) indexes
■ Low selectivity = 100’s of rows
■ DDL – which Bins to index
■String or Range Integer
■ In RAM – fast
■ Multi-node
■Collocated with primary index
■ Reference local data only
■ Index creation
■Tools: AQL, ascli
■Client API – developer only

User Defined Functions
User Defined Functions (UDFs) are an extensibility mechanism.
Adding a function that can be evaluated in the Cluster, local to the data.
■ UDFs are common in many databases:
■MySQL, SQL server, Oracle, DB2, Redis, Postgress, and others
■ UDFs move the compute close to the data
■ Lua (and C) today
© 2014 Aerospike. All rights reserved. Confidential 20

© 2014 Aerospike. All rights reserved. Confidential 21
Storage

© 2014 Aerospike. All rights reserved. Confidential 22
Data Storage Layer

AEROSPIKE
HYBRID MEMORY SYSTEM™
© 2014 Aerospike. All rights reserved. Confidential 23
Data on Flash / SSD
■ Indexes in RAM (64 bytes per)
■Low wear
■ Data in Flash (SSD)
■Record data stored contiguously
■ 1 read per record (multithreaded)
■Automatic continuous defragment, eviction
■Log structured file system, “copy on write”
■ O_DIRECT, O_SYNC
■Data written in flash optimal large blocks
■Automatic distribution (no RAID)
■Writes cached
BLOCK INTERFACE
SSD SSD SSD

© 2014 Aerospike. All rights reserved. Confidential 24
Data in RAM
Data in RAM is very fast – at a price
■ Indexes and Data
■ $$$ (great < 100G, Cloud)
■ More servers
■ Super fast
■ Optional HDD as backing store
Every cluster should have a RAM namespace
■ High frequency and low latency keys (hot keys)

Reading a record (select)
The entire record is read from storage into the server RAM, only the
requested Bins are returned to the client.
© 2014 Aerospike. All rights reserved. Confidential
© 2014 Aerospike. All rights reserved. Confidential 26

Writing a new record (insert)
Entries are added into the primary and any secondary indexes. The record
is placed in a write buffer to be written to the next available block.
© 2014 Aerospike. All rights reserved. Confidential
© 2014 Aerospike. All rights reserved. Confidential 27

Updating a record (update)
The entire record is read in to server RAM, updated as needed, then written
to a new block (copy on write).
The index(es) point to the new location.
© 2014 Aerospike. All rights reserved. Confidential
© 2014 Aerospike. All rights reserved. Confidential 28

Deleting a record (delete)
Deleting a record removes the entries
from the indexes only. Very fast
The background defragmentation will
physically delete the record at a future
time.
© 2014 Aerospike. All rights reserved. Confidential 29

Cross Datacenter Replication – XDR
© 2014 Aerospike. All rights reserved. Confidential 30

© 2014 Aerospike. All rights reserved. Confidential 31
XDR Architecture
Distributed clusters Each node in the cluster

© 2014 Aerospike. All rights reserved. Confidential 32
XDR Topologies
Simple Active-Passive Simple Active-Active
Star Replication More Complex Topology

© 2014 Aerospike. All rights reserved. Confidential 33
Failure Handling
Node failure within a cluster – nodes with replica data will continue
Link failure – XDR keeps track of link failures and data to be shipped over
that link. It will recover when the link comes up.
Node failure in a Cluster Link failure between Clusters

© 2014 Aerospike. All rights reserved. Confidential 34
Shipping
XDR can have fine control shipping
■ Namespaces
■ Sets
Compression

© 2014 Aerospike. All rights reserved. Confidential 35
Summary
You have learned about the Aerospike architecture
■ Client,
■ Cluster
■ Storage
■Primary & Secondary indexes
■RAM
■Flash
■ Cross Datacenter Replication (XDR)

© 2014 Aerospike. All rights reserved. Confidential 36

What's hot

Aerospike ArchitecturePeter Milne

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)Altinity Ltd

MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...Severalnines

An overview of Neo4j InternalsTobias Lindaaker

Open Source 101 2022 - MySQL Indexes and HistogramsFrederic Descamps

Configuring Aerospike - Part 1Aerospike, Inc.

HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.

Top 5 mistakes when writing Spark applicationshadooparchbook

Spark shuffle introductioncolorant

Redo log improvements MYSQL 8.0Mydbops

Iceberg: a fast table format for S3DataWorks Summit

Thousands of Threads and Blocking I/OGeorge Cao

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward

Apache Hudi: The Path ForwardAlluxio, Inc.

Mariadb une base de données NewSQLChristophe Villeneuve

Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Mydbops

Apache Flink in the Cloud-Native EraFlink Forward

Tuning and Debugging in Apache SparkPatrick Wendell

What's hot (20)

Aerospike Architecture

Apache Iceberg: An Architectural Look Under the Covers

HBase and HDFS: Understanding FileSystem Usage in HBase

New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)

MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...

An overview of Neo4j Internals

Open Source 101 2022 - MySQL Indexes and Histograms

Configuring Aerospike - Part 1

HBaseCon 2013: Apache HBase Table Snapshots

Top 5 mistakes when writing Spark applications

Spark shuffle introduction

Redo log improvements MYSQL 8.0

Iceberg: a fast table format for S3

Thousands of Threads and Blocking I/O

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Apache Hudi: The Path Forward

Mariadb une base de données NewSQL

Wars of MySQL Cluster ( InnoDB Cluster VS Galera )

Apache Flink in the Cloud-Native Era

Tuning and Debugging in Apache Spark

Similar to Aerospike Architecture

Brian Bulkowski. AerospikeVolha Banadyseva

What enterprises can learn from Real Time Bidding (RTB)bigdatagurus_meetup

What enterprises can learn from Real Time BiddingAerospike

ACID & CAP: Clearing CAP Confusion and Why C In CAP ≠ C in ACIDAerospike, Inc.

Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Aerospike

Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli

Effectively deploying hadoop to the cloudAvinash Ramineni

Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB

Predictable Big Data Performance in Real-timeAerospike, Inc.

Principles of High Load, Peter Milne, Aerospike Dataconomy Media

DAS RAID NAS SANGhassen Smida

Principles of High Load - Vilnius January 2015Peter Milne

Re-Think Storage – PernixData. Meet & greet with Frank DennemanDigicomp Academy AG

Linux on System z – disk I/O performanceIBM India Smarter Computing

IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...In-Memory Computing Summit

Ambedded - how to build a true no single point of failure ceph cluster inwin stack

NoSQL in Real-time ArchitecturesRonen Botzer

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc

Aerospike AdTech Gets Hacked in Lower ManhattanAerospike

You Snooze You Lose or How to Win in Ad Tech?Aerospike, Inc.

Similar to Aerospike Architecture (20)

Brian Bulkowski. Aerospike

What enterprises can learn from Real Time Bidding (RTB)

What enterprises can learn from Real Time Bidding

ACID & CAP: Clearing CAP Confusion and Why C In CAP ≠ C in ACID

Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...

Quick-and-Easy Deployment of a Ceph Storage Cluster

Effectively deploying hadoop to the cloud

Crimson: Ceph for the Age of NVMe and Persistent Memory

Predictable Big Data Performance in Real-time

Principles of High Load, Peter Milne, Aerospike

DAS RAID NAS SAN

Principles of High Load - Vilnius January 2015

Re-Think Storage – PernixData. Meet & greet with Frank Denneman

Linux on System z – disk I/O performance

IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...

Ambedded - how to build a true no single point of failure ceph cluster

NoSQL in Real-time Architectures

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage

Aerospike AdTech Gets Hacked in Lower Manhattan

You Snooze You Lose or How to Win in Ad Tech?

Recently uploaded

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

AI as an Interface for Commercial BuildingsMemoori

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Pigging Solutions in Pet Food ManufacturingPigging Solutions

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

CloudStudio User manual (basic edition):comworks

Artificial intelligence in the post-deep learning eraDeakin University

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

AI as an Interface for Commercial Buildings

Maximizing Board Effectiveness 2024 Webinar.pptx

Understanding the Laravel MVC Architecture

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

Pigging Solutions Piggable Sweeping Elbows

08448380779 Call Girls In Civil Lines Women Seeking Men

Pigging Solutions in Pet Food Manufacturing

How to Remove Document Management Hurdles with X-Docs?

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Benefits Of Flutter Compared To Other Frameworks

CloudStudio User manual (basic edition):

Artificial intelligence in the post-deep learning era

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Next-generation AAM aircraft unveiled by Supernal, S-A2

Aerospike Architecture

1. IN-MEMORY + NOSQL + ACID DEVELOPING WITH AEROSPIKE ARCHITECTURE OVERVIEW Aerospike aer . o . spike [air-oh- spahyk] noun, 1. tip of a rocket that enhances speed and stability © 2014 Aerospike. All rights reserved. Confidential 1

2. © 2014 Aerospike. All rights reserved. Confidential 2 Objectives This module provides an overview of the Aerospike architecture. At the end of this module you will have a high level understanding of ■ Client ■ Cluster ■ Storage ■Primary & Secondary indexes ■RAM ■Flash ■ Cross Datacenter Replication (XDR)

4. © 2014 Aerospike. All rights reserved. Confidential 4 Aerospike Goals Aerospike technology goals are to meet these challenges: ■ Handle extremely high rates of read/write transactions over persistent data ■ Avoid hot spots to maintain tight latency SLAs ■ Provide immediate consistency with replication ■ Ensure long running tasks do not slow down transactions ■ Scale linearly as data sizes and workloads increase ■ Add capacity with no service interruption

5. Architecture – The Big Picture © 2014 Aerospike. All rights reserved. Confidential 5 1) No Hotspots – DHT simplifies data partitioning 2) Smart Client – 1 hop to data, no load balancers 3) Shared Nothing Architecture, every node identical 7) XDR – sync replication across data centers ensures Zero Downtime 4) Single row ACID – synch replication in cluster 5) Smart Cluster, Zero Touch – auto-failover, rebalancing, rack aware, rolling upgrades.. 6) Transactions and long running tasks prioritized real-time

6. What Aerospike offers © 2014 Aerospike. All rights reserved. Confidential 6 ■ Row oriented ■Key value store ■ Fast ■Like Redis and Memcache ■Whole key space ■ Complex types ■Like Redis and MongoDB ■List/Map/JSON ■Large Data types ■ Queries and Aggregations ■Secondary index ■Sub-second MapRedude ■ High performance ■Linux Daemon ■Multi-core aware ■Multi-socket aware ■No Garbage collection issues ■ Run anywhere ■Cloud friendly ■TCP networking ■ Flash optimized ■Near-DRAM Flash performance

8. © 2014 Aerospike. All rights reserved. Confidential 8 Smart Client™ ■ The Aerospike Client is implemented as a library, JAR or DLL, and consists of 2 parts: ■Operation APIs – These are the operations that you can execute on the cluster – CRUD+ etc. ■First class observer of the Cluster – Monitoring the state of each node and aware on new nodes or node failures.

9. © 2014 Aerospike. All rights reserved. Confidential 9 From Key to Node ■ Distributed Hash Table with No Hotspots ■Every key hashed with RIPEMD160 into an ultra efficient 20 byte (fixed length) string ■Hash + additional (fixed 64 bytes) data forms index entry in RAM ■Some bits from hash value are used to calculate the Partition ID (4096 partitions) ■Partition ID maps to Node ID in the cluster ■ 1 Hop to data ■Smart Client simply calculates Partition ID to determine Node ID ■No Load Balancers required

11. © 2014 Aerospike. All rights reserved. Confidential 11 The Cluster (servers) ■ Local servers ■XDR to remote cluster ■ Automatic Load balancing ■ Quick fail over ■ Detects new nodes (multicast) ■ Rebalances data (measured rate) ■ Add nodes under load ■ Rack awareness ■ “proxy” to correct node ■ Locally attached storage

12. © 2014 Aerospike. All rights reserved. Confidential 12 Data Distribution Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. ■ RIPEMD160 (no collisions yet found) ■ 4096 Data Partitions ■ Even distribution of ■Partitions across nodes ■Records across Partitions ■Data across Flash devices ■ Primary and Replica Partitions

13. © 2014 Aerospike. All rights reserved. Confidential 13 Single Row ACID Writing with Immediate Consistency 1. Write sent to record master 2. Latch against simultaneous writes 3. Apply write synchronously to master memory and replica memory 4. Queue operations to storage 5. Signal completed transaction (optional storage commit wait) 6. Master applies conflict resolution policy (rollback/ roll forward)

14. Automatic rebalancing Adding, or Removing a node, the Cluster automatically rebalances 1. Cluster discovers new node via gossip protocol 2. Paxos vote determines new data organization 3. Partition migrations scheduled 4. When a partition migration starts, write journal starts on destination 5. Partition moves atomically 6. Journal is applied and source data deleted After migration is complete, the Cluster is evenly balanced. © 2014 Aerospike. All rights reserved. Confidential 14

16. Aerospike Management and Monitoring Aerospike provides a sophisticated management console: Aerospike Management Console (AMC). Plugins also are available for: ■ Graphite ■ Nagios APIs for monitoring ■ Roll you own tool ■ 100’s of monitoring parameters ■ Fine grained latency monitoring © 2014 Aerospike. All rights reserved. Confidential 16

17. © 2014 Aerospike. All rights reserved. Confidential 17 Primary Index Primary index ■Hash of Hash of rb trees ■ Index ■64 bytes ■Write generation ■Time To Live ■Storage address ■Uses shared memory for Fast Restart ■ Single bin ■Optimization for minimal data volume ■If using integers, store data in index rather than in storage

18. Primary Index – Single Bin Storing a key-value entry where the value is a single integer can be very useful. A namespace can be configured, and optimized, for a single Bin record. ■ No Bin management overhead ■ Integer – stored in Index for free (data-in-index) © 2014 Aerospike. All rights reserved. Confidential 18

19. © 2014 Aerospike. All rights reserved. Confidential 19 Secondary Indexes ■ Bin (Column) indexes ■ Low selectivity = 100’s of rows ■ DDL – which Bins to index ■String or Range Integer ■ In RAM – fast ■ Multi-node ■Collocated with primary index ■ Reference local data only ■ Index creation ■Tools: AQL, ascli ■Client API – developer only

20. User Defined Functions User Defined Functions (UDFs) are an extensibility mechanism. Adding a function that can be evaluated in the Cluster, local to the data. ■ UDFs are common in many databases: ■MySQL, SQL server, Oracle, DB2, Redis, Postgress, and others ■ UDFs move the compute close to the data ■ Lua (and C) today © 2014 Aerospike. All rights reserved. Confidential 20

23. AEROSPIKE HYBRID MEMORY SYSTEM™ © 2014 Aerospike. All rights reserved. Confidential 23 Data on Flash / SSD ■ Indexes in RAM (64 bytes per) ■Low wear ■ Data in Flash (SSD) ■Record data stored contiguously ■ 1 read per record (multithreaded) ■Automatic continuous defragment, eviction ■Log structured file system, “copy on write” ■ O_DIRECT, O_SYNC ■Data written in flash optimal large blocks ■Automatic distribution (no RAID) ■Writes cached BLOCK INTERFACE SSD SSD SSD

24. © 2014 Aerospike. All rights reserved. Confidential 24 Data in RAM Data in RAM is very fast – at a price ■ Indexes and Data ■ $$$ (great < 100G, Cloud) ■ More servers ■ Super fast ■ Optional HDD as backing store Every cluster should have a RAM namespace ■ High frequency and low latency keys (hot keys)

25. Reading a record (select) The entire record is read from storage into the server RAM, only the requested Bins are returned to the client. © 2014 Aerospike. All rights reserved. Confidential © 2014 Aerospike. All rights reserved. Confidential 26

26. Writing a new record (insert) Entries are added into the primary and any secondary indexes. The record is placed in a write buffer to be written to the next available block. © 2014 Aerospike. All rights reserved. Confidential © 2014 Aerospike. All rights reserved. Confidential 27

27. Updating a record (update) The entire record is read in to server RAM, updated as needed, then written to a new block (copy on write). The index(es) point to the new location. © 2014 Aerospike. All rights reserved. Confidential © 2014 Aerospike. All rights reserved. Confidential 28

28. Deleting a record (delete) Deleting a record removes the entries from the indexes only. Very fast The background defragmentation will physically delete the record at a future time. © 2014 Aerospike. All rights reserved. Confidential 29

32. © 2014 Aerospike. All rights reserved. Confidential 33 Failure Handling Node failure within a cluster – nodes with replica data will continue Link failure – XDR keeps track of link failures and data to be shipped over that link. It will recover when the link comes up. Node failure in a Cluster Link failure between Clusters

34. © 2014 Aerospike. All rights reserved. Confidential 35 Summary You have learned about the Aerospike architecture ■ Client, ■ Cluster ■ Storage ■Primary & Secondary indexes ■RAM ■Flash ■ Cross Datacenter Replication (XDR)

Editor's Notes

Architecture Overview
The Big Picture Aerospike is a distributed, scalable NoSQL database. It is architected with three key objectives: To create a flexible, scalable platform that would meet the needs of today’s web-scale applications To provide the robustness and reliability (ie, ACID) expected from traditional databases. To provide operational efficiency (minimal manual involvement) First published in the Proceedings of VLDB (Very Large Databases) in 2011, the Aerospike architecture consists of 3 layers: The cluster-aware Client Layer includes open source client libraries that implement Aerospike APIs, track nodes and know where data reside in the cluster. The self-managing Clustering and Data Distribution Layer oversees cluster communications and automates fail-over, replication, cross data center synchronization and intelligent re-balancing and data migration. The flash-optimized Data Storage Layer reliably stores data in RAM and Flash. Aerospike scales up and out on commodity servers. 1) Each multi-threaded, multi-core, multi-cpu server is fast 2) Each server scales up to manage up to 16TB of data, with parallel access to up to 20 SSDs 3) Aerospike scales out linearly with each identical server added, to parallel process 100TB+ across the cluster(AppNexus was in production with a 50 node cluster that has now scaled up by increasing SSD capacity per node and reducing node count to 12, in preparation for scaling back up on node counts and SSD capacity )- see http://www.aerospike.com/wp-content/uploads/2013/12/Aerospike-AppNexus-SSD-Case-Study_Final.pdf SYNC replication within cluster for immediate consistency, #replicas from 1 to #nodes in cluster, typical deployment has 2 or 3 copies of data; Can support ASYNC within cluster. ASYNC replication across clusters using Cross Data Center Replication (XDR). Writes automatically replicated across one or more locations. Fine grained control - individual sets or namespaces can be replicated across clusters in master/master/slave complex ring and star topologies.
Row oriented Aerospike is a row oriented key value store with high throughput and low latency primary key operations. Fast Like Redis and Memcache, but Aerospike scales linearly and easily without sharding. Developers concentrate on the business logic rather than the “plumbing”. Latencies of primary key operations of single digit milliseconds are routinely achieved, across the whole key space. Complex types Like Redis and MongoDB, Aerospike supports simple types like Integer and String plus it supports complex types Like List and Map. Complex types can nest inside each other accommodate documents like JSON. Aerospike also supports collections called Large Data Types of Ordered List, Map, Stack and Set. Each element in a large data type is stored in a sub record, allowing these types to store millions of elements with the same access latency regardless of the size of the collection. Large Data Types are semantically equivalent to “infinite columns” in Cassandra. Queries and Aggregations Queries on secondary indexes provide a mechanism similar to traditional RDBMS – the “where” clause. These are particularly good for low selectivity queries. The results of a Query can be piped into a stream that is fed into an Aggregation, this is essentially indexed MapReduce and can be done in near real time. High performance service The Aerospike Daemon is an optimized, high performance Linux service. It is multi-core/multi-socket aware and NUMA node aware. Because it is written in C it does not suffer from service interruption because of garbage collection.
Client Layer The Aerospike “smart client” is designed for speed. It is implemented as an open source linkable library available in C, C#, Java, PHP, Go, node.js and Python, and developers are free to contribute new clients or modify them as needed. The Client Layer has the following functions: Implements the Aerospike API, the client-server protocol and talks directly to the cluster. Tracks nodes and knows where data is stored, instantly learning of changes to cluster configuration or when nodes go up or down. Implements its own TCP/IP connection pool for efficiency. Also detects transaction failures that have not risen to the level of node failures in the cluster and re-routes those transactions to nodes with copies of the data. Transparently sends requests directly to the node with the data and re-tries or re-routes requests as needed. One example is during cluster re-configurations. This architecture reduces transaction latency, offloads work from the cluster and eliminates work for the developer. It also ensures that applications do not have to be restarted when nodes are brought up or down. Finally, it eliminates the need to setup and manage additional cluster management servers or proxies.
From Key to Node Aerospike has a shared nothing architecture, all nodes in a cluster are identical and there is no single point of failure. The cluster scales elastically and linearly as nodes are added or removed. Data distributed evenly across nodes using RIPEMD160 Distribution algorithm minimizes data movement when nodes come or go and cluster must be re-balanced. Proprietary paxos based consensus algorithm auto-discovers and efficiently manages cluster formation . Read/writes continue non-stop during cluster re-organization. Self-healing architecture includes auto-discovery of cluster changes, auto-failover, auto-rebalancing, background backups/restores and rolling upgrades ensure zero touch and zero downtime operations. Only manual intervention required is for physical changes to cluster e.g. adding or upgrading hardware.
Cluster Layer The Aerospike “shared nothing” architecture is designed to reliably stores terabytes of data with automatic fail-over, replication and cross data-center synchronization. This layer scales linearly and implements many of the ACID guarantees. This layer is also designed to eliminate manual operations with the systematic automation of all cluster management functions. It includes 3 modules: The Cluster Management Module tracks nodes in the cluster. The key algorithm is a Paxos-like consensus voting process which determines which nodes are considered part of the cluster. Aerospike implement special heartbeat (active and passive) to monitor inter-node connectivity. When a node is added or removed and cluster membership is ascertained, each node uses distributed hash algorithm to divide the primary index space into data 'slices' and assign owners. Data Migration Module then intelligently balances the distribution of data across nodes in the cluster, and ensures that each piece of data is duplicated across nodes and across data centers, as specified by the system’s configured replication factor. Division is purely algorithmic, the system scales without a master and eliminates the need for additional configuration that is required in a sharded environment. The Transaction Processing Module reads and writes data as requested and provides many of the consistency and isolation guarantees. This module is responsible for Sync/Async Replication : For writes with immediate consistency, it propagates changes to all replicas before committing the data and returning the result to the client. Proxy : In rare cases during cluster re-configurations when the Client Layer may be briefly out of date, it transparently proxies the request to another node. Duplicate Resolution : when a cluster is recovering from being partitioned, it resolves any conflicts that may have occurred between different copies of data. Resolution can be configured to be Automatic, in which case the data with the latest timestamp is canonical User driven, in which case both copies of the data can be returned to the application for resolution at that higher level. Once you have the first cluster up, you can optionally install additional clusters in other data centers and setup cross data-center replication – this ensures that if your data center goes down, the remote cluster can take over the workload with minimal or no interruption to users.
Data Distribution Aerospike Database has a Shared-Nothing architecture: every node in an Aerospike cluster is identical, all nodes are peers and there is no single point of failure. Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. This algorithm uses an extremely random hash function that ensures that partitions are distributed very evenly within a 1-2% error margin. To determine where a record should go, the record key (of any size) is hashed into a 20-byte fixed length string using RIPEMD160, and the first 12 bits form a partition ID which determines which of the partitions should contain this record. The partitions are distributed equally among the nodes in the cluster, so if there are N nodes in the cluster, each node stores approximately 1/N of the data. Because data is distributed evenly (and randomly) across nodes, there are no hot spots or bottlenecks where one node handles significantly more requests than another node. The random assignment of data ensures that server load is balanced. For reliability, Aerospike replicates partitions on one or more nodes. One node becomes the data master for reads and writes for a partition and other nodes store replicas. Synchronous replication ensures immediate consistency and no loss of data. A write transaction is propagated to all replicas before committing the data and returning the result to the client. In rare cases during cluster re-configurations when the Aerospike Smart Client™ may have sent the request to the wrong node because it is briefly out of date, the Aerospike Smart Cluster™ transparently proxy's the request to the right node. Finally, when a cluster is recovering from being partitioned, it resolves any conflicts that may have occurred between different copies of data. Resolution can be configured to be automatic, in which case the data with the latest timestamp is canonical or both copies of the data can be returned to the application for resolution at that higher level.
Immediate consistency – by default Master node for a record will coordinate with the replica node(s) before committing the write to RAM. These are then asynchronously written to storage on master and replica nodes. As an option, administrators can configure the master to write to the replica asynchronously, but this is a very uncommon case.. Aerospike implements efficient row level locking (latches) for reads and writes. Aerospike implements synchronous replication (immediate consistency) across multiple copies of the same data records within a single cluster by default, the most common deployment mode.Aerospike partitioning and clustering algorithms use automatic locking and latching to ensure consistency of the data across shards, partitions and databases.
A node is added to the cluster When a node is added to the cluster, to provide increases capacity or throughput, the existing nodes use the heartbeat to locate the new node. The cluster then begins the process of data migration so that the new node becomes data master for some partitions and replica for other partitions. The client also detects these changes automatically and future requests are sent to the correct node. What happens when a node is removed? If a node has a hardware failure or it is being upgraded the inverse happens. On a 4 node cluster 1 node is the data master for about 1/4 of the data and those partitions now exist only on the replicas on nodes. The remaining nodes automatically perform data migration to copy the replica partitions to new data masters. At the same time, your application, through the Smart Client™ becomes aware that the node has failed and the client automatically calculates the new partition map. Automatic rebalancing The data rebalancing mechanism (ensures that query volume is distributed evenly across all nodes, and is robust in the event of node failure happening during rebalancing itself. The system is designed to be continuously available, so data rebalancing doesn't impact cluster behavior. The transaction algorithms are integrated with the data distribution system, and there is only one consensus vote to coordinate a cluster change. With only one vote, there is only a short period when the cluster internal redirection mechanisms are used while clients discover the new cluster configuration. Thus, this mechanism optimizes transactional simplicity in a scalable shared-nothing environment while maintaining ACID characteristics. By not requiring operator intervention, the cluster will self-heal even at demanding times. The capacity planning and system monitoring provide you the ability to handle virtually any unforeseen failure with no loss of service. You can configure and provision your hardware capacity and set up replication/synchronization policies so that the database recovers from failures without affecting users. Data migration can be tuned to meet your operational requirements see http://www.aerospike.com/docs/operations/tune/migration/
Clustering Aerospike uses distributed processing to ensure data reliability. In theory, many databases could actually get all of the required throughput from a single server with a huge SSD. However this would not provide any redundancy and if the server went down, all database access would stop. So a more typical configuration is several nodes, with each node having several SSD devices. Aerospike typically runs on fewer servers than other databases. The distributed, shared-nothing architecture means that nodes can self-manage and coordinate to ensure that there are no outages to the user, while at the same time being easy to expand your cluster as traffic increases. Heartbeat The nodes in the cluster keep track of each other through a heartbeat so that they can coordinate among themselves. The nodes are peers – there is no one node that is the master, all of the nodes track the other nodes in the cluster. When nodes are added or removed, it is detected by the nodes in the cluster using heartbeat mechanism. Aerospike have two way of defining cluster Multicast in this case multicast IP:PORT is used to broadcast heartbeat message Mesh in this case address of one Aerospike server is used to join cluster High speed distributed consensus Aerospike's 24/7 reliability guarantee starts with its ability to detect cluster failure and quickly recovering from that situation to reform the cluster. Once the cluster change is discovered, all surviving nodes use a Paxos-like consensus voting process to determine which nodes form the cluster and Aerospike Smart Partitions™ algorithm to automatically re-allocate partitions and re-balance. The hashing algorithm is deterministic – that is, it always maps a given record to the same partition. Data records stay in the same partition for their entire life although partitions may move from one server to another. All of the nodes in the Aerospike system participate in a Paxos distributed consensus algorithm, which is used to ensure agreement on a minimal amount of critical shared state. The most critical part of this shared state is the list of nodes that are participating in the cluster. Consequently, every time a node arrives or departs, the consensus algorithm runs to ensure that agreement is reached. This process takes a fraction of a second. After consensus is achieved each individual node agrees on both the participants and their order within the cluster. Using this information the master node for any transaction can be computed along with the replica nodes.
Management and Monitoring Aerospike provides a sophisticated management console: Aerospike Management Console(AMC). Through AMC you can manage and monitor every aspect of your cluster. Plugins are also available for Graphite and Nagios. You can also “roll you own” tooling with the comprehensive “Info” APIs. Handling Traffic Saturation A detailed discussion of the network hardware required to handle peak traffic loads is beyond the scope of this course. Aerospike Database provides you with monitoring tools that let you evaluate bottlenecks. If the network is bottlenecking, then the database will not be running at capacity and requests will be slow. Handling Capacity Overflows We have lots of suggestions for capacity planning, managing your storage and monitoring the cluster to ensure that your storage doesn't overflow, but in the case that it does, Aerospike reaches the stop writes limit – in that case, no more new records are accepted. But data updates and reads will continue to be processed. In other words, even beyond optimal capacity, the database does not stop handling requests, it continues to do as much as possible to keep processing user requests.
Primary Index In an operational database, the fastest and most predictable index is the primary key index. This index provides the most predictable and fastest access to row information in a database. Aerospike's primary key index is a blend of distributed hash table technology with a distributed tree structure in each server. Entire keyspace in a set (table) is partitioned using a robust hash function into partitions. There are total of 4096 partitions and are equally distributed across nodes in the cluster. Please see data-distribution for details of hashing and partitioning. At the lowest level, a red-black in-memory structure is used, similar to the data structures used in the popular MemCache system. Primary Index is on the 20 byte hash (or Digest) of the specified primary key. While this expands the key size of some records (which might have, for example, a unique 8 byte key), it is beneficial because the code works predictably regardless of input key size, or input key distribution. When a single server fails, the indexes on a second server are immediately available. If the failed server remains down, data starts rebalancing. As the data rebalances, the index is computed on the new server as the data arrives. Index Metadata Each index entry currently requires 64 bytes. Along with digest following metadata is also stored in index. Write generation - or the vector clock: This tracks the all the updates to the key in the system. This is used while resolving conflicting updates Time to live: This tracks the life of a key in system. This is time at which key should expire and is used by eviction subsystem. Storage address : The location on storage (both in-memory and persistence) for the data. Index Persistence To maintain throughput, primary indexes are not committed to storage - only to RAM. This allows writes to be very high performance, and highly parallel. The data storage layer may also be configured not to use storage. When Aerospike server starts, it walks through data on the storage and creates primary index for all the data partitions. Fast Restart In order to support fast upgrades of clusters with no downtime, Aerospike supports a feature called "fast restart". The index entries are allocated from slabs of memory, which are stored as shared memory. When "fast restart" is configured, the index memory is allocated from a Linux shared memory segment. In case of planned shutdown and restart of Aerospike (e.g upgrades), on restart server simply attaches to the shared memory segment and brings primary indexes live. This does not require any data scan in the storage
Single bin integer optimization Storing a key-value entry where the value is a single integer can be very useful. When the "single bin" feature is enabled on a namespace - providing lower memory use than supporting the Aerospike bin structure on each record - a further optimization is available. When the stored value is an integer, the space which would point to the bin information is reused as the integer. For this special case, the amount of storage required for integer values is simply the space required for the index.
Secondary Index Secondary indexes are on the non-primary key, which gives ability to model one to many relationships. Indexes are specified on a bin-by-bin (like columns in an RDBMS) basis. This allows efficient updates and minimizes the amount of resources required to store the indexes. A data description (DDL) is used to determine which bins and type are to be indexed. Indexes can be created and removed dynamically through provided tools or the API. This DDL, which appears similar to an RDBMS schema, is NOT used for data validation - even if a bin is in the DDL as indexed, that bin is optional. For an indexed bin, updating the record to include the bin updates the index. For example, an index can be created only on strings OR integers. Consider the case where you have a bin for storing a user's age and the age value is stored as a string by one application and an integer by another application. An integer index will exclude records which contain string values stored in the indexed bin while a string index would exclude records with integer values stored in the indexed bin. Secondary indexes are: Stored in RAM for fast look-up Built on every node in cluster and is collocated with the primary. Each secondary index entry contains references to records local to the node only. Secondary indexes contain pointers to both master records and replicated records in a cluster. Data Structure Like the Aerospike primary index, the secondary index is blend of a hash table with a B-tree. The secondary index is structurally a Hash of B-tree of B-trees. Each logical secondary index has 32 physical trees. When a key is identified for index operation, a hash function is used to determine which of the physical trees the secondary index entry needs to be made, out of 32 physical trees. After identifying B-tree updates are made into the tree, note that corresponding to the single secondary index key there can be many primary record references. Thus, the lowest level of this structure is a list of full fledged B-trees which is based on the number of records which needs to be stored. Index Creation Aerospike supports dynamic creation of secondary indexes. Tools such as aql are available which read the indexes currently available, and allow creation and destruction of indexes. To build a secondary index, users need to specify a namespace, set, bin and type of the index (e.g., integer, string, etc.). One receiving confirmation from SMD (see above) each node creates the secondary index in write-active mode and starts a background scan job, which scans all of the data and inserts entries into the secondary index. Index entries is only be created for records that match all of the index specifications. The scan job that populates the secondary index will interact with read/write transactions in exactly the same way a normal scan would, except there is no network component to the index creation scan unlike for the normal scan. During the index creation period, all new writes that affect the indexing attribute will update the index. As soon as the index creation scan is completed and all of the index entries are created, the index is immediately ready for use by queries and is marked read-active. Secondary index is available for normal querying after index has been created successfully on all the nodes.
User Defined Function In SQL databases, a user-defined function provides a mechanism for extending the functionality of the database server by adding a function that can be evaluated in SQL statements. UDFs allow the compute to move close to data. The function is run on the same server where the data resides.
Data Storage Layer Aerospike is a key-value store with schema-less data model. Data is organized into policy containers called namespaces, semantically similar to databases in an RDBMS system. Within a namespace, data is subdivided into sets (similar to tables) and records (similar to rows). Each record has an indexed key that is unique in the set, and one or more named bins (similar to columns) that hold values associated with the record. Sets and bins do not need to be defined up front, but can be added during run-time for maximum flexibility. Values in bins are strongly typed, and can include any of supported data-types. Bins themselves are not typed, so different records could have the same bin with values of different types. Indexes ( primary keys and secondary keys ) are stored in RAM for ultra-fast access and values can be stored either in RAM or more cost-effectively on SSDs. Each namespace can be configured separately, so small namespaces can take advantage of DRAM and larger ones gain the cost benefits of SSDs. The Data Layer was particularly designed for speed and a dramatic reduction in hardware costs. It can operate all in-memory, eliminating the need for a caching layer or it can take advantage of unique optimizations for flash storage. In either case, data is never lost. 100 Million keys take up only 6.4GB. Although keys have no size limitations, each key is efficiently stored in just 64 bytes. Native, multi-threaded, multi-core Flash I/O and an Aerospike log structured file system take advantage of low level SSD read and write patterns. In addition, writes to disk are performed in large blocks to minimize latency. This mechanism bypasses the standard file system, historically tuned to rotational disks. Also built-in are a Smart Defragmenter and Intelligent Evictor . These processes work together to ensure that there is space in RAM and that data is never lost and safely written to disk. The Defragmenter tracks the number of active records in each block and reclaims blocks that fall below a minimum level of use. The Evictor removes expired records and reclaims memory if the system gets beyond a set high water mark. Expiration times are configured per namespace, the age of a record is calculated from the last time it was modified, the application can override the default lifetime at any time and specify that a record should never be evicted.
Hybrid Storage The Hybrid Memory System holds the indexes and data stored in each node, and handles interactions with the physical storage. It also contains modules that automatically remove old data from the database, and defragment the physical storage to optimize disk usage. Aerospike can store data in RAM or on Flash (SSD), and each namespace can be configured separately. This configuration flexibility allows an application developer to put a small namespace that is frequently accessed in RAM, but put a larger namespace in less expensive storage such an SSD. Significant work has been done to optimize data storage on SSDs, including bypassing the file system to take advantage of low-level SSD read and write patterns. Philosophy Other than Large Data Types, all the data for a record is stored together. The amount of storage for each row is limited by default to 1 MB in size, which can be increased. Storage is copy-on-write, with free space being reclaimed by the defragmentation process. Each namespace is configured with a fixed amount of storage. Each node must have the same namespaces in each server, and the same amount of storage for each namespace. Storage can be configured with pure RAM without persistence, RAM with storage persistence, or using Flash storage (SSDs). Eviction based on storage The Defragmenter tracks the number of active records on each block on disk, and reclaims blocks that fall below a minimum level of use. The Evictor is responsible for removing references to expired records, and for reclaiming memory if the system gets beyond a set high water mark. When configuring a namespace, the administrator specifies the maximum amount of RAM used for that namespace, as well as the default lifetime for data in the namespace. Under normal operation, the Evictor looks for data that has expired, freeing the index in memory and releasing the record on disk. The Evictor also tracks the memory used by the namespace, and releases older, although not necessarily expired, records if the memory exceeds the configured high water mark. By allowing the Evictor to remove old data when the system hits its memory limitations, Aerospike can effectively be used as an LRU cache. Note that the age of a record is measured from the last time it was modified, and that the Application can override the default lifetime any time it writes data to the record. The Application may also tell the system that a particular record should never be automatically evicted.
Data in RAM Data purely in RAM - without persistence - has the benefits of higher throughput. Although modern Flash storage is very high performance, RAM is even higher performance, and prices of RAM are dropping. Data is allocated through the JEMalloc allocator. JEMalloc allows allocation into different pools. Long term allocations - such as those for the storage layer - can be allocated separately. We have found the JEMalloc allocator has exceptional properties in terms of low fragmentation. By using multiple copies of RAM, very high levels of reliability can be obtained. Since Aerospike automatically re-shards and replicates data on failure or cluster node addition, a high level of "k-safety" can be obtained. Bringing a node back online automatically populates its data from one of the copies. Due to Aerospike's random data distribution, data unavailability when several nodes are lost can be quite small. For example, in a 10 node cluster with two copies of the data, if two nodes are lost quickly, then the amount of data unavailable is before replication is about 2% - 1/50 of the data. When a persistence layer is configured, reads always occur from the RAM copy. Writes occur through the data path as described below.
Data on SSD / Flash When data is to be written, a write latch is taken for the row, to avoid two conflicting writes on the same record. In some cluster states, data may need to be read from other nodes and conflicts resolved. After the write is validated, the in-memory representations of the record are updated on the master. The data that will be written is added to a write buffer. If the write buffer is full, it is queued to disk. Depending on the size of the write buffer - which is the same as the maximum row size - and the throughput of writes, there will be a certain risk of uncommitted data. If there are replicas, they are updated, and their in-memory indexes are updated as well. After data has been updated on all copies in-memory, the result is returned to the client. The system can be configured to return a result to the client before all writes - delayed consistency. Flash optimizations The Defragmenter tracks the number of active records on each block on disk, and reclaims blocks that fall below a minimum level of use. The defragmentation system continually scans available blocks, and looks for blocks which are more than a certain amount free. Wear Leveling [words here] Block size [words here] Copy on write [words here] Continuous background defrag [words here] Even distribution across devices No RAID
Reading a record The entire record is read from storage into the server RAM, only the requested Bins are returned to the client. The process: Client finds Master Node from partition map. Client makes read request to Master Node. Master Node finds data location from index in RAM. Master Node reads entire record from flash. (true even if only reading bin) Master Node returns value.
Writing a new record Entries are added into the primary and any secondary indexes. The record is placed in a write buffer to be written to the next available block. The process: Client finds Master Node from partition map. Client makes write request to Master Node. Master Node make an entry indo index (in RAM) and queues write in temporary write buffer. Master Node coordinates write with replica nodes (not shown). Master Node returns success to client. Master Node asynchronously writes data in blocks. Index in RAM points to location on flash.
Updating a record The record is read in to server RAM, updated as needed, then written to a new block (copy on write). The index(es) are updated to point to the new location. The storage space of the old record is recovered a some future time by the automatic background defragmentation. The process: Client finds Master Node from partition map. Client makes update request to Master Node. Master Node reads the existing record Master Node queues write of updated record in a temporary write buffer Master Node coordinates write with replica nodes (not shown). Master Node returns success to client. Master Node asynchronously writes data in blocks. Index in RAM points to new location on flash.
Deleting a record in Flash Deleting a record removes the entries from the indexes only. Very fast The background defragmentation will physically delete the record at a future time. The process: Client finds Master Node from partition map. Client makes delete request to Master Node. Master node removes the entry from index(es) Master Node coordinates deletion with replica nodes (not shown). Master Node returns success to client. Data is eventually deleted from flash by defragmentation.
XDR Architecture XDR - Cross Datacenter Replication - is an Aerospike feature which allows one cluster to synchronize with another cluster over a longer delay link. This replication is asynchronous, but delay times are often under a second. Each write is logged, and that log is used to replicate to remote clusters. Each master for a partition is responsible for replicating that partition's writes, and the current write status is replicated to the local cluster's partition replica, so when that replica is promoted, it can take over where the previous master left off. An XDR process runs on every node of the cluster along side the Aerospike process. The main task of the XDR process is to monitor database updates on the "local" cluster and send copies of all write requests to one or more "remote" destination clusters. XDR determines the remote cluster details from the main Aerospike configuration files. It is possible to ship different namespaces (or even sets) to different clusters. In addition, the XDR process takes care of failure handling. Failures can be either local node failures or link failure to the remote cluster or both. Each node runs: Aerospike daemon process (asd) XDR daemon process (xdr), which is composed of the following: Logger module Shipper module Failure Handler module An XDR process runs on every node of the cluster. All the writes/updates that happen on each node in the local cluster need to be forwarded to one or more remote clusters. In order to do this efficiently, XDR uses a log of all digests of the keys that need to be shipped. When XDR writes to the remote cluster, it will get the current value of the key at that time. This is done so that if there are multiple changes to the same record, only the latest value is sent; intermediary values are not. XDR does not store any values of the record in the digest log. The details of this are that writes and updates by the main database system are communicated with its XDR process over a named pipe. This communication is received by the Logger module which stores this information in the digest log. To keep the digest log file small, only minimal information (the key digest) is stored in the digest log – just enough to ship the actual records at a later time. The Digest Log is a ring buffer file. i.e the most recent data will potentially overwrite the oldest data (if the buffer is full). This allows us to keep a check on this file size. But this also introduces a capacity planning issue. You need to configure a big enough digest log file to avoid the loss of any pending data to be shipped. You should plan for long term outages due to problems communicating between your data centers. For instance, in the event one of the data centers is down due to weather or earthquake.
XDR Topologies XDR can be configured with various topologies such as simple active-active, active-passive, star, and combination of these. The star topology allows one data center to simultaneously replicate data to multiple data centers. This complements Aerospike’s support for complex multi-master ring replication topologies. You companies have unprecedented flexibility to implement the replication approach that best supports your needs. Simple Active-Passive In active-passive topology, initial writes happen only at one cluster. The remote cluster is a stand-by cluster which can be used for reads. The system does not have a mechanism to stop the application writes (non-xdr writes) on the remote cluster. The application should therefore not write any data directly on the destination cluster. The XDR shipping on the destination cluster should be disabled using the config option "enable-xdr" so that any inadvertent writes on the remote cluster are not shipped out to other clusters. This option may be used when intense analysis of data is needed, but you do not want to affect performance on the main cluster. The main cluster should be configured as the active one and the analysis cluster as passive. Simple Active-Active In active-active topology, writes can happen at both clusters. The writes happening on one cluster will be forwarded to the other. XDR takes care that the data shipped is not forwarded further, thereby avoiding an infinite loop of back and forth writes between clusters. As XDR allows writes to happen on both nodes, the same key can be updated on both clusters at the same time, thereby leading to inconsistent data. As the Aerospike database is agnostic to the application data, it cannot auto-correct this inconsistency. The window in which this inconsistency can happen is between the time the first write happened and the time at which it is shipped and rewritten to the second cluster, i.e, if the write happens on the remote cluster in this time window, it will lead to inconsistency which cannot be automatically resolved by the server. However, the subsequent update operation will again force a sync of the data, assuming the same concurrent write behavior does not repeat itself. In cases where the same record may be updated in two different clusters at the same time, this mechanism may not be suitable. In many other cases, primary writes to a record will be strongly associated with a single cluster and the other clusters are acting as a hot backup. These are very good use cases for active-active. For example, a company has users spread across North America. Traffic is divided between the West Coast and East Coast data centers. While it is possible for a user normally on the West Coast to be traveling to the East Coast and thus be serviced by the East Coast data center, it is unlikely that writes for this user will occur simultaneously in both data centers at the same time. This is a very good use case for the active-active topology. Star Replication XDR can ship data to multiple destination clusters. You can achieve this by simply specifying multiple destination clusters in the namespace configuration. Each destination cluster is specified in a separate line with the config parameter xdr-remote-datacenter. This is most commonly used when data will be centrally published and replicated in multiple locations for low-latency read access from local systems.
Failure Handling XDR will manage different kinds of failures: Local node failure Remote link failure Combinations of the above Local node failure In the case of a local node failure the replica nodes are used to synchronize data with the remote cluster. The Aerospike cluster maintains the state of each node and the data for which it is primary and replica. If a node fails, the cluster will automatically continue the XDR writes using the replica data on the other nodes. Remote link failure. If the connection between the local and remote clusters stops, XDR keeps track of the span of time(s) for which the link was unavailable. Then when the link becomes available, XDR ships data for that link. XDR within a cluster keeps track of the last ship time by any node in the cluster. When XDR is configured to use star topology, a cluster can ship to multiple data centers simultaneously. In this case one or more data center links can go bad. XDR continues to ship to the available data centers. It remembers the times when the links are down for each configured data center. When the link becomes available XDR ships the current writes as well as the writes from the period when the link was down. XDR also handles more complex scenarios, such as local node failures combined with remote link failures. Combinations of failures XDR also handles combinations of failures such as local node down with remote link failure, failure of link when the XDR is shipping historical data etc.
Per namespace shipping An Aerospike node can have multiple namespaces. You can configure different namespaces to ship to different remote clusters. The following is an example which demonstrates this. DC1 is shipping namespaces NS1 and NS2 to DC2 and shipping namespace NS3 to DC3. This flexibility has helped many customers to set replication rules differently for different sets of data. You might notice in above example that the writes to namespace NS1 could keep looping around the clusters forming a ring. To avoid this, XDR has an option to stop forwarding the writes which are coming from another XDR. Finer shipping control with sets XDR can be configured to ship certain sets (analogous to tables) to a data center. This allows even finer control over which data to be shipped. So the combination of namespace and set will determine whether to ship a record or not. This is often done because not all data in a namespace in a local cluster needs to be replicated in other clusters. Compression XDR has an option to compress the data before shipping. It helps to save the bandwidth between data centers. The administrator can configure the minimum threshold size of the record to be compressed before shipping and only records larger than the minimum size will undergo compression.

Aerospike Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Aerospike Architecture

Similar to Aerospike Architecture (20)

Recently uploaded

Recently uploaded (20)

Aerospike Architecture

Editor's Notes