This slides presents three key-value stores using log-structure, includes Riak, RethinkDB, LevelDB. BTW, i state that RethinkDB employs append-only B-tree and that is an estimate made by combining guessing wih reasoning!
Log Structure• A log-structured file system is a file system design first proposed in 1988 by John K. Ousterhout and Fred Douglis.• Design for high write throughput, all updates to data and metadata are written sequentially to a continuous stream, called a log.• Conventional file systems tend to lay out files with great care for spatial locality and make in-place changes to their data structures.
Log Structure for flash memory• Random write degrades the system performance and shrinks the lifetime of flash memory.• Log structure is flash-friendly natively! Magnetic Disk Flash Memory new data 1 data 1 free new data 1 erased data 1 free RAM block data 2 free erased data 2 free data 2 free new data 3 data 3 free erased data 3 free data 3 free data 4 free free free block free free free free
Riak ?• Riak is an open source, highly scalable, fault-tolerant distributed database.• Supported core features: - operate in highly distributed environments - no single point of failure - highly fault-tolerant - scales simply and intelligently - highly data available - low cost of operations
Bitcask• A Bitcask instance is a directory, and only one operating system process will open that Bitcask for writing at a given time.• The active file is only written by appending, which means that sequential writes do not require disk seeking.
Hash Index: keydir• A keydir is simply a hash table that maps every key in a Bitcask to a fixed-size structure giving the file, offset and size of the most recently written entry for that key .
Merge• The merge process iterates over all non-active file and produces as output a set of data files containing only the “live” or latest versions of each present key.• During the merge process, for each merged data file, a byproduct called hint file is generated, which can be used to make startup and crash recovery easy.
RethinkDB ?• RethinkDB is a persistent, industrial-strength key-value store with full support for the Memcached protocol.• Powerful technology: - Linear scaling across cores - Fine-grained durability control - Instantaneous recovery on power failure• Supported core features: - Atomic increment/decrement - Values up to 10MB in size - Multi-GET support - Up to one million transactions per second on commodity hardware
Installation & usage• RethinkDB works on modern 64-bit distributions of Linux. Ubuntu 10.04.1 x86_64 Ubuntu 10.10 x86_64 Red Hat Enterprise Linux 5 x86_64 CentOS 5 x86_64 SUSE Linux 10• Running the rethinkdb server: Default installation path: /usr/bin/rethinkdb-1.0 ./rethinkdb-1.0 -f /u01/rethinkdb_data ./rethinkdb-1.0 -f /u01/rethinkdb_data -c 4 -p 11500 ./rethinkdb-1.0 -f /u01/rethinkdb_data -f /u03/rethinkdb_data -c 4 -p 11500
The methodology• Firstly, lack of mechanical parts makes random reads on SSD are significantly efficient!• Secondly, random writes trigger more erases, making these operations expensive, and decreasing the drive lifetime!• RethinkDB takes an append-only approach to storing data, pioneered by log-structured file system! What are the consequences of appen- only ?
Append-only consequences Data Consistency 1) eliminating data locality Hot Backups requires a larger number of disk access Instantaneous Recovery Easy Replication 2) large amount of data that quickly becomes obsolete in Lock-Free Concurrency an environment with a heavy insert or update Live Schema Changes workload Database Snapshots
LevelDB ?• LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.• Supported core features: - Data is stored sorted by key - Multiple changes can be made in one atomic batch - Users can create a transient snapshot to get a consistent view of data - Data is automatically compressed using the Snappy compression library
Installation & usage• LevelDB works with snappy, which is a compression /decompression library. download snappy from http://code.google.com/p/snappy/ cd snappy-1.0.4 ./configure && make && make install• It is a library, no database server! svn checkout http://leveldb.googlecode.com/svn/trunk/leveldb-read-only cd leveldb-read-only make && cp libleveldb.a /usr/local/lib && libleveldb.a cp -r include/leveldb /usr/local/include
Log-structure merge tree• Log file: a log file (*.log) stores a sequence of recent updates and each update isRead appended to the current Memtable log file. Memory• Memtable: a in-memory strcucture keeps a copy of Disk SSTable SSTable theLog file log file. current SSTable SSTable• Sorted tables: a sorted table (*.sst) stores a…sequence … …… SSTable of entries sorted by key and each entry is either a Write SSTable SSTable value for the key, or a deletion marker for the key. Level-0 Level-1
Conclusion• Log-structure enjoys high write throughput and makes data consistency, hot backups, recovery and snapshot easy.• Log-structure eliminates the data locality, queries require a larger number of random disk access consequently.• An excellent garbage collection method can be very important to log-structure storage system.