SlideShare a Scribd company logo
Elliptics
building a distributed, fault-tolerant data
storage
Rim Zaydullin
25 September 2017
1
Safe and…
really safe
Safe… or is it unlikely to break?
In 21st century we figured out a way to get around disk problems
RAID, replication, Reed-Solomon coding, LDPC and many others
*enterprise IBM hard drive, circa 1980.
1.7 or 3.4 gb capacity, price — 250 000 USD
What if it is “some master server”?
But what will happen when the server goes down?
What if the whole datacenter goes down?
Should you plan for this?
“It will never crash.”
The probability of these events can be VERY small
What will be with your business/systems if it happens after all?
“Things always become obvious after the fact” 
― Nassim Nicholas Taleb
Reasons for losses of servers, data-centers, coherence
• Tornado, earth quake, flood
• Tech support made a change onto the wrong rack
• Errors made by NOCs
• A cat who got into the electrical transformer and burned
together with equipment
• Virtual machines cluster got a new really angry neighbor
• Cloud provider suddenly went down (say hello amazon S3!)
• Excavator tearing an underground optical cable
while digging a ditch
*all the above examples are from real life
You can fix anything… if you have enough time and money.
And if you have nothing else to do :)
Choosing the data storage
system that is right for you.
You need to answer the following questions:
What is your record size: Bytes? Kbytes? Mb? Gb? Tb?
Do you need:
- transactions?
- replication?
- fastest access possible?
- query language?
- full-text search?
- CAP properties?
- scalability options?
…
To put it simple:
- Massively scalable - replica sets of DHTs
- Fault tolerant by design
- Fast - async I/O, caching, Eblob, bloom filters
- Ease of use: C,C++,Go,HTTP REST,WEBDAV, (S3)
- One point of entry for the clients
Elliptics:
- a very fast, linearly scalable NoSQL (key/value) data storage
- based on DHT principles
- designed to store medium to large data records, > 1Kb and up to terabytes
Features:
- No transactions support, but write to one replica is atomic
- CAP - Availability, Partition tolerance + Eventual consistency
- No metadata servers, true horizontal scaling
- Replication - geographically distributed replication
- Direct P2P data streaming (useful for large files)
- Access speed - true O(1) data read access + SLRU cache
- Automatic data repartitioning in case of removed or added storage nodes
- Bulk writes
- Datacenter aware (cross datacenter replication) and CDN
- and much more…
Opensource (GPL), implemented in C/C++
CAP theorem
Consistency
Availability
Partition
Tolerance
All clients see
the same data at the
same time
Will always respond
to a request, even if
data is not completely
consistent
Works even in
presence of
node/network
failures
RMDBS:
MySQL/MariaDB
Postgres
MSSQL
Oracle
Elastic Search
…
CACP
PA
HBASE
MongoDB
Redis
Google Big Table
Ceph
…
Elliptics
Cassandra
Riak
DynamoDB
CouchDB
…
THE GREAT MISCONCEPTION
about
EVENTUAL CONSISTENCY
The ultimate performance guide:
* parralel write
* async I/O
The ultimate performance guide:
* SLRU
* cache everything
* 2 million RPS from 10 nodes
*O(1) lookup time
*Redirect and CDN
*Data Streaming
*HTTP(S) access
Backrunner - HTTP(S) Go / C / C++/WEBDAV client
Bucket Bucket Bucket
Replica
Replica Replica
Replica
Replica Replica
Replica
Replica Replica
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
eblob
Storage hardware
Elliptics storage
Backrunner - HTTP(S) Go/C++/C/WEBDAV clientClients:
Buckets:
Bucket: N replicas
Replica: DHT
(Distributed HashTable)
Elliptics: Fast, Open… and safe.
In production at:
Yandex.Disk
Yandex.Mail
Yandex.Maps
Yandex.Music
Yandex.Photos
http://elliptics.io
Q&A
Rim Zaydullin
Thank you!
zaydullinr@seagroup.com
Q&A
Additional technical slides
- Scalable - DHT
- Fault tolerant by design
- Fast - Eblob, async I/O, caching, bloom filters
- Simplicity of usage: C/C++/Go/HTTP REST
- One point of entry for the clients
To put it simple:
Terminology:
1) Bucket - set of replicas
2) Replica - one set of data (one DHT)
3) DHT - Distributed Hash Table
4) Hash ring - consistent hashing algorithm
5) Node - one of nodes in Elliptics network
02048
Node 1
Hashring ranges
Node 2
Hashring ranges
Hash Ring
for simplification,
in reality 2^512
*this and following slides following is a simplification of
what’s actually happening
IP addr Hash ring segments
Node 1
Node 2
Node 1
routing table
IP addr Hash ring segments
Node 1
Node 2
Node 2
routing table
Start-up and DHT initialization
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2
Node 1
routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1
Node 2
routing table
Start-up and DHT initialization
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1
Start-up and DHT initialization
routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2
routing table
Node 1
Hashring ranges
Node 2
Hashring ranges
02048
12
44
90
129
644
1608
Hash Ring
Client
Client connection
Node 1 routing table
Node 2 routing table
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Client
Writing data
Node 1 routing table
Node 2 routing table
elliptics.write(“key1”, data1)
hash(“key1”) == 20
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 1
Hashring ranges
Node 2
Hashring ranges
02048
12
44
90
129
644
1608
Hash Ring
20
Client
Writing data
Node 1 routing table
Node 2 routing table
node1.write(“key1”, data1)elliptics.write(“key1”, data1)
hash(“key1”) == 20
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Client
Reading data
Node 1 routing table
Node 2 routing table
node1.read(“key1”)elliptics.read(“key1”)
hash(“key1”) == 20
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Scaling
Client
Add new node
Node 1 routing table
Node 2 routing table
Node 3 routing table
IP addr Hash ring segments
Node 3
Node 2
Node 1
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Client
Add new node
Node 1 routing table
Node 2 routing table
IP addr Hash ring segments
Node 3 300, 666, 1024
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 3 routing table
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Client
Add new node
Node 1 routing table
Node 2 routing table
IP addr Hash ring segments
Node 3 300, 666, 1024
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 3 routing table
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 3 300, 666, 1024
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 3 300, 666, 1024
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 3 300, 666, 1024
- Scalable - DHT
- Fault tolerant by design
- Fast - Eblob, async I/O, caching, bloom filters
- Simplicity of usage: C/C++/Go/HTTP REST
- One point of entry for the clients
To put it simple:
Client
Loosing a node
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1
Node 2 routing table
IP addr Hash ring segments
Node 1
Node 2 44, 129, 1608
Node 1
Hashring ranges
Node 2
Hashring ranges
02048
44
129
1608
Hash Ring
Client
Writing data (with failed nodes)
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1
Node 2 routing table
IP addr Hash ring segments
Node 1
Node 2 44, 129, 1608
node2.write(“key1”, data1)
elliptics.write(“key1”, data1)
hash(“key1”) == 20
“key1” -> data
Client
Reading data (with failed nodes)
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1
Node 2 routing table
IP addr Hash ring segments
Node 1
Node 2 44, 129, 1608
elliptics.read(“key1”)
hash(“key1”) == 20
“key1” -> data
Client
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2 routing table
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
elliptics.read(“key1”)
hash(“key1”) == 20
Reading data (with restored nodes)
“key1” -> data
node1.read(“key1”)
Merge - special procedure to move
keys and data that do not belong to the
local node. Such keys are moved to the
nodes they belong to, restoring
consistency.
* Merge is FAST
Client
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Node 1 routing table
IP addr Hash ring segments
Node 2 44, 129, 1608
Node 1 12, 90, 644
Node 2 routing table
IP addr Hash ring segments
Node 1 12, 90, 644
Node 2 44, 129, 1608
Merge
“key1” -> data
hash(“key1”) == 20
Elliptics backend — EBLOB
Elliptics backend — EBLOB
Eblob is an append-only low-level IO library, which saves data in blob files.
Elliptics uses it as one of its low-level IO backends.
Supported features:
- Fast append-only updates which do not require disk seeks
- Compact index to populate lookup information from disk
- Multi-threaded index reading during startup (gives you fast storage start)
- O(1) data location lookup time (for in-memory indexes)
- Ability to lock in-memory lookup index (hash table) to eliminate memory swap
- Readahead games with data and index blobs for maximum performance
- Multiple blob files support (tested with single blob file on block device too)
- Optional sha512 on-disk checksumming
- Direct streaming from eblob to client, there’s an Nginx module for that
Elliptics backend — EBLOB
Supported features:
- 2-stage write: prepare (which reserves the space) and commit (which calculates
checksum and update in-memory and on-disk indexes). One can (re)write data using
pwrite() in between without locks
- Usuall 1-stage write interface
- Flexible configuration of hash table size, flags, alignment
- Defragmentation tool: entries to be deleted are only marked as removed, eblob_check will
iterate over specified blob files and actually remove those blocks
- Off-line blob consistency checker: eblob_check can verify checksums for all records
which have them enabled
- Run-time sync support — dedicated thread runs fsync in background on all files on timed
base
- Sorted data and indexes on disk – ideal for column creation, iteration, subkeys and range
requests
- In-memory index compression (upto 60%) ~64 bytes per key in RAM
Elliptics backend — EBLOB

More Related Content

What's hot

Couchbase Data Pipeline
Couchbase Data PipelineCouchbase Data Pipeline
Couchbase Data Pipeline
Justin Michaels
 
Rapid Home Provisioning
Rapid Home ProvisioningRapid Home Provisioning
Rapid Home Provisioning
Ludovico Caldara
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
Enkitec
 
ION Durban - IPv6 Case Study (Liquid Telecom)
ION Durban - IPv6 Case Study (Liquid Telecom)ION Durban - IPv6 Case Study (Liquid Telecom)
ION Durban - IPv6 Case Study (Liquid Telecom)
Deploy360 Programme (Internet Society)
 
Campus networking
Campus networkingCampus networking
Campus networking
Jisc
 
Ogg oracle goldengate-v3.0
Ogg oracle goldengate-v3.0Ogg oracle goldengate-v3.0
Insync10 goldengate
Insync10 goldengateInsync10 goldengate
Insync10 goldengate
InSync Conference
 
NoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATNoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RAT
HenryBowers
 
Effective Oracle Home Management in the new Release Model era
Effective Oracle Home Management in the new Release Model eraEffective Oracle Home Management in the new Release Model era
Effective Oracle Home Management in the new Release Model era
Ludovico Caldara
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
Connor McDonald
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The Hood
Ludovico Caldara
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceMySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
Severalnines
 
Database Cloud Services Office Hours - 0421 - Migrate AWS to OCI
Database Cloud Services Office Hours - 0421 - Migrate AWS to OCIDatabase Cloud Services Office Hours - 0421 - Migrate AWS to OCI
Database Cloud Services Office Hours - 0421 - Migrate AWS to OCI
Tammy Bednar
 
Writing High-Performance Software by Arvid Norberg
Writing High-Performance Software by Arvid NorbergWriting High-Performance Software by Arvid Norberg
Writing High-Performance Software by Arvid Norberg
bittorrentinc
 
Handling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryHandling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow Story
DataWorks Summit
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
Dr. Wilfred Lin (Ph.D.)
 
Transition to ipv6 cgv6-edited
Transition to ipv6  cgv6-editedTransition to ipv6  cgv6-edited
Transition to ipv6 cgv6-edited
Fred Bovy
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major Announcements
Monica Li
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 

What's hot (20)

Couchbase Data Pipeline
Couchbase Data PipelineCouchbase Data Pipeline
Couchbase Data Pipeline
 
Rapid Home Provisioning
Rapid Home ProvisioningRapid Home Provisioning
Rapid Home Provisioning
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
ION Durban - IPv6 Case Study (Liquid Telecom)
ION Durban - IPv6 Case Study (Liquid Telecom)ION Durban - IPv6 Case Study (Liquid Telecom)
ION Durban - IPv6 Case Study (Liquid Telecom)
 
Campus networking
Campus networkingCampus networking
Campus networking
 
Ogg oracle goldengate-v3.0
Ogg oracle goldengate-v3.0Ogg oracle goldengate-v3.0
Ogg oracle goldengate-v3.0
 
Insync10 goldengate
Insync10 goldengateInsync10 goldengate
Insync10 goldengate
 
NoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATNoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RAT
 
Effective Oracle Home Management in the new Release Model era
Effective Oracle Home Management in the new Release Model eraEffective Oracle Home Management in the new Release Model era
Effective Oracle Home Management in the new Release Model era
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The Hood
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceMySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
 
Database Cloud Services Office Hours - 0421 - Migrate AWS to OCI
Database Cloud Services Office Hours - 0421 - Migrate AWS to OCIDatabase Cloud Services Office Hours - 0421 - Migrate AWS to OCI
Database Cloud Services Office Hours - 0421 - Migrate AWS to OCI
 
Writing High-Performance Software by Arvid Norberg
Writing High-Performance Software by Arvid NorbergWriting High-Performance Software by Arvid Norberg
Writing High-Performance Software by Arvid Norberg
 
Handling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow StoryHandling Kernel Upgrades at Scale - The Dirty Cow Story
Handling Kernel Upgrades at Scale - The Dirty Cow Story
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
 
Transition to ipv6 cgv6-edited
Transition to ipv6  cgv6-editedTransition to ipv6  cgv6-edited
Transition to ipv6 cgv6-edited
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major Announcements
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 

Similar to Elliptics

Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
Aleksei Voitylov
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
confluent
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
Redis Labs
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
Vipin Varghese
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
Mason Mei
 
L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing
L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load BalancingL3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing
L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing
Jan Schaumann
 
Short Introduction to IPv6
Short Introduction to IPv6Short Introduction to IPv6
Short Introduction to IPv6
Martin Schütte
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Tech f42
Tech f42Tech f42
Network.pptx
Network.pptxNetwork.pptx
Network.pptx
SAMANTHACARDOSO13
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
Trivadis
 
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Cybersecurity Education and Research Centre
 
Mcse notes
Mcse notesMcse notes
Mcse notes
vrammn
 
mar07-redis.pdf
mar07-redis.pdfmar07-redis.pdf
mar07-redis.pdf
AnisSalhi3
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
Ben Stopford
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
Sheetal Dolas
 
Introduction to sockets tcp ip protocol.ppt
Introduction to sockets tcp ip protocol.pptIntroduction to sockets tcp ip protocol.ppt
Introduction to sockets tcp ip protocol.ppt
MajedAboubennah
 

Similar to Elliptics (20)

Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 
L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing
L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load BalancingL3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing
L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing
 
Short Introduction to IPv6
Short Introduction to IPv6Short Introduction to IPv6
Short Introduction to IPv6
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Tech f42
Tech f42Tech f42
Tech f42
 
Network.pptx
Network.pptxNetwork.pptx
Network.pptx
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
 
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
 
Mcse notes
Mcse notesMcse notes
Mcse notes
 
mar07-redis.pdf
mar07-redis.pdfmar07-redis.pdf
mar07-redis.pdf
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Introduction to sockets tcp ip protocol.ppt
Introduction to sockets tcp ip protocol.pptIntroduction to sockets tcp ip protocol.ppt
Introduction to sockets tcp ip protocol.ppt
 

Recently uploaded

Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 

Recently uploaded (20)

Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 

Elliptics

  • 1. Elliptics building a distributed, fault-tolerant data storage Rim Zaydullin 25 September 2017 1
  • 3. Safe… or is it unlikely to break?
  • 4. In 21st century we figured out a way to get around disk problems RAID, replication, Reed-Solomon coding, LDPC and many others *enterprise IBM hard drive, circa 1980. 1.7 or 3.4 gb capacity, price — 250 000 USD
  • 5. What if it is “some master server”? But what will happen when the server goes down?
  • 6. What if the whole datacenter goes down? Should you plan for this?
  • 7. “It will never crash.”
  • 8. The probability of these events can be VERY small What will be with your business/systems if it happens after all? “Things always become obvious after the fact”  ― Nassim Nicholas Taleb
  • 9. Reasons for losses of servers, data-centers, coherence • Tornado, earth quake, flood • Tech support made a change onto the wrong rack • Errors made by NOCs • A cat who got into the electrical transformer and burned together with equipment • Virtual machines cluster got a new really angry neighbor • Cloud provider suddenly went down (say hello amazon S3!) • Excavator tearing an underground optical cable while digging a ditch *all the above examples are from real life
  • 10. You can fix anything… if you have enough time and money. And if you have nothing else to do :)
  • 11. Choosing the data storage system that is right for you. You need to answer the following questions: What is your record size: Bytes? Kbytes? Mb? Gb? Tb? Do you need: - transactions? - replication? - fastest access possible? - query language? - full-text search? - CAP properties? - scalability options? …
  • 12. To put it simple: - Massively scalable - replica sets of DHTs - Fault tolerant by design - Fast - async I/O, caching, Eblob, bloom filters - Ease of use: C,C++,Go,HTTP REST,WEBDAV, (S3) - One point of entry for the clients
  • 13. Elliptics: - a very fast, linearly scalable NoSQL (key/value) data storage - based on DHT principles - designed to store medium to large data records, > 1Kb and up to terabytes Features: - No transactions support, but write to one replica is atomic - CAP - Availability, Partition tolerance + Eventual consistency - No metadata servers, true horizontal scaling - Replication - geographically distributed replication - Direct P2P data streaming (useful for large files) - Access speed - true O(1) data read access + SLRU cache - Automatic data repartitioning in case of removed or added storage nodes - Bulk writes - Datacenter aware (cross datacenter replication) and CDN - and much more… Opensource (GPL), implemented in C/C++
  • 14. CAP theorem Consistency Availability Partition Tolerance All clients see the same data at the same time Will always respond to a request, even if data is not completely consistent Works even in presence of node/network failures RMDBS: MySQL/MariaDB Postgres MSSQL Oracle Elastic Search … CACP PA HBASE MongoDB Redis Google Big Table Ceph … Elliptics Cassandra Riak DynamoDB CouchDB …
  • 16. The ultimate performance guide: * parralel write * async I/O
  • 17. The ultimate performance guide: * SLRU * cache everything * 2 million RPS from 10 nodes
  • 18. *O(1) lookup time *Redirect and CDN *Data Streaming *HTTP(S) access
  • 19. Backrunner - HTTP(S) Go / C / C++/WEBDAV client Bucket Bucket Bucket Replica Replica Replica Replica Replica Replica Replica Replica Replica eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob eblob Storage hardware
  • 20. Elliptics storage Backrunner - HTTP(S) Go/C++/C/WEBDAV clientClients: Buckets:
  • 23. Elliptics: Fast, Open… and safe. In production at: Yandex.Disk Yandex.Mail Yandex.Maps Yandex.Music Yandex.Photos http://elliptics.io
  • 26. - Scalable - DHT - Fault tolerant by design - Fast - Eblob, async I/O, caching, bloom filters - Simplicity of usage: C/C++/Go/HTTP REST - One point of entry for the clients To put it simple:
  • 27. Terminology: 1) Bucket - set of replicas 2) Replica - one set of data (one DHT) 3) DHT - Distributed Hash Table 4) Hash ring - consistent hashing algorithm 5) Node - one of nodes in Elliptics network
  • 28. 02048 Node 1 Hashring ranges Node 2 Hashring ranges Hash Ring for simplification, in reality 2^512 *this and following slides following is a simplification of what’s actually happening
  • 29. IP addr Hash ring segments Node 1 Node 2 Node 1 routing table IP addr Hash ring segments Node 1 Node 2 Node 2 routing table Start-up and DHT initialization
  • 30. IP addr Hash ring segments Node 1 12, 90, 644 Node 2 Node 1 routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 Node 2 routing table Start-up and DHT initialization
  • 31. IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 1 Start-up and DHT initialization routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644 Node 2 routing table
  • 32. Node 1 Hashring ranges Node 2 Hashring ranges 02048 12 44 90 129 644 1608 Hash Ring
  • 33. Client Client connection Node 1 routing table Node 2 routing table IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644
  • 34. Client Writing data Node 1 routing table Node 2 routing table elliptics.write(“key1”, data1) hash(“key1”) == 20 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644
  • 35. Node 1 Hashring ranges Node 2 Hashring ranges 02048 12 44 90 129 644 1608 Hash Ring 20
  • 36. Client Writing data Node 1 routing table Node 2 routing table node1.write(“key1”, data1)elliptics.write(“key1”, data1) hash(“key1”) == 20 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644
  • 37. Client Reading data Node 1 routing table Node 2 routing table node1.read(“key1”)elliptics.read(“key1”) hash(“key1”) == 20 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644
  • 39. Client Add new node Node 1 routing table Node 2 routing table Node 3 routing table IP addr Hash ring segments Node 3 Node 2 Node 1 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608
  • 40. Client Add new node Node 1 routing table Node 2 routing table IP addr Hash ring segments Node 3 300, 666, 1024 Node 2 44, 129, 1608 Node 1 12, 90, 644 Node 3 routing table IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608
  • 41. Client Add new node Node 1 routing table Node 2 routing table IP addr Hash ring segments Node 3 300, 666, 1024 Node 2 44, 129, 1608 Node 1 12, 90, 644 Node 3 routing table IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 3 300, 666, 1024 IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644 Node 3 300, 666, 1024 IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 3 300, 666, 1024
  • 42. - Scalable - DHT - Fault tolerant by design - Fast - Eblob, async I/O, caching, bloom filters - Simplicity of usage: C/C++/Go/HTTP REST - One point of entry for the clients To put it simple:
  • 43. Client Loosing a node IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 1 routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 Node 2 routing table IP addr Hash ring segments Node 1 Node 2 44, 129, 1608
  • 44. Node 1 Hashring ranges Node 2 Hashring ranges 02048 44 129 1608 Hash Ring
  • 45. Client Writing data (with failed nodes) IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 1 routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 Node 2 routing table IP addr Hash ring segments Node 1 Node 2 44, 129, 1608 node2.write(“key1”, data1) elliptics.write(“key1”, data1) hash(“key1”) == 20 “key1” -> data
  • 46. Client Reading data (with failed nodes) IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 1 routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 Node 2 routing table IP addr Hash ring segments Node 1 Node 2 44, 129, 1608 elliptics.read(“key1”) hash(“key1”) == 20 “key1” -> data
  • 47. Client IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 1 routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644 Node 2 routing table IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 elliptics.read(“key1”) hash(“key1”) == 20 Reading data (with restored nodes) “key1” -> data node1.read(“key1”)
  • 48. Merge - special procedure to move keys and data that do not belong to the local node. Such keys are moved to the nodes they belong to, restoring consistency. * Merge is FAST
  • 49. Client IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Node 1 routing table IP addr Hash ring segments Node 2 44, 129, 1608 Node 1 12, 90, 644 Node 2 routing table IP addr Hash ring segments Node 1 12, 90, 644 Node 2 44, 129, 1608 Merge “key1” -> data hash(“key1”) == 20
  • 51. Elliptics backend — EBLOB Eblob is an append-only low-level IO library, which saves data in blob files. Elliptics uses it as one of its low-level IO backends. Supported features: - Fast append-only updates which do not require disk seeks - Compact index to populate lookup information from disk - Multi-threaded index reading during startup (gives you fast storage start) - O(1) data location lookup time (for in-memory indexes) - Ability to lock in-memory lookup index (hash table) to eliminate memory swap - Readahead games with data and index blobs for maximum performance - Multiple blob files support (tested with single blob file on block device too) - Optional sha512 on-disk checksumming - Direct streaming from eblob to client, there’s an Nginx module for that
  • 52. Elliptics backend — EBLOB Supported features: - 2-stage write: prepare (which reserves the space) and commit (which calculates checksum and update in-memory and on-disk indexes). One can (re)write data using pwrite() in between without locks - Usuall 1-stage write interface - Flexible configuration of hash table size, flags, alignment - Defragmentation tool: entries to be deleted are only marked as removed, eblob_check will iterate over specified blob files and actually remove those blocks - Off-line blob consistency checker: eblob_check can verify checksums for all records which have them enabled - Run-time sync support — dedicated thread runs fsync in background on all files on timed base - Sorted data and indexes on disk – ideal for column creation, iteration, subkeys and range requests - In-memory index compression (upto 60%) ~64 bytes per key in RAM