Elliptics

Elliptics
building a distributed, fault-tolerant data
storage
Rim Zaydullin
25 September 2017
1

Safe… or is it unlikely to break?

In 21st century we figured out a way to get around disk problems
RAID, replication, Reed-Solomon coding, LDPC and many others
*enterprise IBM hard drive, circa 1980.
1.7 or 3.4 gb capacity, price — 250 000 USD

What if it is “some master server”?
But what will happen when the server goes down?

What if the whole datacenter goes down?
Should you plan for this?

The probability of these events can be VERY small
What will be with your business/systems if it happens after all?
“Things always become obvious after the fact”
― Nassim Nicholas Taleb

Reasons for losses of servers, data-centers, coherence
• Tornado, earth quake, flood
• Tech support made a change onto the wrong rack
• Errors made by NOCs
• A cat who got into the electrical transformer and burned
together with equipment
• Virtual machines cluster got a new really angry neighbor
• Cloud provider suddenly went down (say hello amazon S3!)
• Excavator tearing an underground optical cable
while digging a ditch
*all the above examples are from real life

You can fix anything… if you have enough time and money.
And if you have nothing else to do :)

Choosing the data storage
system that is right for you.
You need to answer the following questions:
What is your record size: Bytes? Kbytes? Mb? Gb? Tb?
Do you need:
- transactions?
- replication?
- fastest access possible?
- query language?
- full-text search?
- CAP properties?
- scalability options?
…

To put it simple:
- Massively scalable - replica sets of DHTs
- Fault tolerant by design
- Fast - async I/O, caching, Eblob, bloom filters
- Ease of use: C,C++,Go,HTTP REST,WEBDAV, (S3)
- One point of entry for the clients

Elliptics:
- a very fast, linearly scalable NoSQL (key/value) data storage
- based on DHT principles
- designed to store medium to large data records, > 1Kb and up to terabytes
Features:
- No transactions support, but write to one replica is atomic
- CAP - Availability, Partition tolerance + Eventual consistency
- No metadata servers, true horizontal scaling
- Replication - geographically distributed replication
- Direct P2P data streaming (useful for large files)
- Access speed - true O(1) data read access + SLRU cache
- Automatic data repartitioning in case of removed or added storage nodes
- Bulk writes
- Datacenter aware (cross datacenter replication) and CDN
- and much more…
Opensource (GPL), implemented in C/C++

CAP theorem
Consistency
Availability
Partition
Tolerance
All clients see
the same data at the
same time
Will always respond
to a request, even if
data is not completely
consistent
Works even in
presence of
node/network
failures
RMDBS:
MySQL/MariaDB
Postgres
MSSQL
Oracle
Elastic Search
…
CACP
PA
HBASE
MongoDB
Redis
Google Big Table
Ceph
…
Elliptics
Cassandra
Riak
DynamoDB
CouchDB
…

THE GREAT MISCONCEPTION
about
EVENTUAL CONSISTENCY

The ultimate performance guide:
* parralel write
* async I/O

The ultimate performance guide:
* SLRU
* cache everything
* 2 million RPS from 10 nodes

*O(1) lookup time
*Redirect and CDN
*Data Streaming
*HTTP(S) access

Backrunner - HTTP(S) Go / C / C++/WEBDAV client
Bucket Bucket Bucket
Replica
Replica Replica
Replica
Replica Replica
Replica
Replica Replica
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
Storage hardware

Elliptics storage
Backrunner - HTTP(S) Go/C++/C/WEBDAV clientClients:
Buckets:

Replica: DHT
(Distributed HashTable)

Elliptics: Fast, Open… and safe.
In production at:
Yandex.Disk
Yandex.Mail
Yandex.Maps
Yandex.Music
Yandex.Photos
http://elliptics.io

Q&A
Rim Zaydullin
Thank you!
zaydullinr@seagroup.com

Q&A
Additional technical slides

- Scalable - DHT
- Fault tolerant by design
- Fast - Eblob, async I/O, caching, bloom filters
- Simplicity of usage: C/C++/Go/HTTP REST
- One point of entry for the clients
To put it simple:

Terminology:
1) Bucket - set of replicas
2) Replica - one set of data (one DHT)
3) DHT - Distributed Hash Table
4) Hash ring - consistent hashing algorithm
5) Node - one of nodes in Elliptics network

02048
Node 1
Hashring ranges
Node 2
Hashring ranges
Hash Ring
for simplification,
in reality 2^512
*this and following slides following is a simplification of
what’s actually happening

IP addr Hash ring segments
Node 1
Node 2
Node 1
routing table
Node 1
Node 2
Node 2
routing table
Start-up and DHT initialization

Node 1 12, 90, 644
Node 2
Node 1
routing table
Node 2 44, 129, 1608
Node 1
Node 2
routing table

Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1
routing table
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2
routing table

Node 1
Hashring ranges
Node 2
Hashring ranges
02048
12
44
90
129
644
1608
Hash Ring

Client
Client connection
Node 1 routing table
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644

Client
Writing data
elliptics.write(“key1”, data1)
hash(“key1”) == 20
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644

Node 1
Hashring ranges
Node 2
Hashring ranges
02048
12
44
90
129
644
1608
Hash Ring
20

Client
Writing data
node1.write(“key1”, data1)elliptics.write(“key1”, data1)
hash(“key1”) == 20
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644

Client
Reading data
node1.read(“key1”)elliptics.read(“key1”)
hash(“key1”) == 20
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644

Client
Add new node
Node 3
Node 2
Node 1
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1 12, 90, 644
Node 2 44, 129, 1608

Client
Add new node
Node 3 300, 666, 1024
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1 12, 90, 644
Node 2 44, 129, 1608

Client
Add new node
Node 3 300, 666, 1024
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 3 300, 666, 1024
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 3 300, 666, 1024
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 3 300, 666, 1024

Client
Loosing a node
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1
Node 1
Node 2 44, 129, 1608

Node 1
Hashring ranges
Node 2
Hashring ranges
02048
44
129
1608
Hash Ring

Client
Writing data (with failed nodes)
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1
Node 1
Node 2 44, 129, 1608
node2.write(“key1”, data1)
elliptics.write(“key1”, data1)
hash(“key1”) == 20
“key1” -> data

Client
Reading data (with failed nodes)
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1
Node 1
Node 2 44, 129, 1608
elliptics.read(“key1”)
hash(“key1”) == 20
“key1” -> data

Client
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1 12, 90, 644
Node 2 44, 129, 1608
elliptics.read(“key1”)
hash(“key1”) == 20
Reading data (with restored nodes)
“key1” -> data
node1.read(“key1”)

Merge - special procedure to move
keys and data that do not belong to the
local node. Such keys are moved to the
nodes they belong to, restoring
consistency.
* Merge is FAST

Client
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1 12, 90, 644
Node 2 44, 129, 1608
Merge
“key1” -> data
hash(“key1”) == 20

Elliptics backend — EBLOB
Eblob is an append-only low-level IO library, which saves data in blob files.
Elliptics uses it as one of its low-level IO backends.
Supported features:
- Fast append-only updates which do not require disk seeks
- Compact index to populate lookup information from disk
- Multi-threaded index reading during startup (gives you fast storage start)
- O(1) data location lookup time (for in-memory indexes)
- Ability to lock in-memory lookup index (hash table) to eliminate memory swap
- Readahead games with data and index blobs for maximum performance
- Multiple blob files support (tested with single blob file on block device too)
- Optional sha512 on-disk checksumming
- Direct streaming from eblob to client, there’s an Nginx module for that

Elliptics backend — EBLOB
Supported features:
- 2-stage write: prepare (which reserves the space) and commit (which calculates
checksum and update in-memory and on-disk indexes). One can (re)write data using
pwrite() in between without locks
- Usuall 1-stage write interface
- Flexible configuration of hash table size, flags, alignment
- Defragmentation tool: entries to be deleted are only marked as removed, eblob_check will
iterate over specified blob files and actually remove those blocks
- Off-line blob consistency checker: eblob_check can verify checksums for all records
which have them enabled
- Run-time sync support — dedicated thread runs fsync in background on all files on timed
base
- Sorted data and indexes on disk – ideal for column creation, iteration, subkeys and range
requests
- In-memory index compression (upto 60%) ~64 bytes per key in RAM

Elliptics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elliptics

Similar to Elliptics (20)

Recently uploaded

Recently uploaded (20)

Elliptics