SlideShare a Scribd company logo
Bitcask
                           A Log-Structured Hash Table for Fast Key/Value Data

                                           Justin Sheehy David Smith
                                          with inspiration from Eric Brewer
                                               Basho Technologies
                                       justin@basho.com,dizzyd@basho.com




   The origin of Bitcask is tied to the history of the Riak distributed database. In a Riak key/value cluster, each
node uses pluggable local storage; nearly anything k/v-shaped can be used as the per-host storage engine. This
pluggability allowed progress on Riak to be parallelized such that storage engines could be improved and tested
without impact on the rest of the codebase.
   Many such local key/value stores already exist, including but not limited to Berkeley DB, Tokyo Cabinet,
and Innostore. There are many goals we sought when evaluating such storage engines, including:

 • low latency per item read or written
 • high throughput, especially when writing an incoming stream of random items
 • ability to handle datasets much larger than RAM w/o degradation
 • crash friendliness, both in terms of fast recovery and not losing data
 • ease of backup and restore
 • a relatively simple, understandable (and thus supportable) code structure and data format
 • predictable behavior under heavy access load or large volume
 • a license that allowed for easy default use in Riak

   Achieving some of these is easy. Achieving them all is less so.

   None of the local key/value storage systems available (including but not limited to those written by the
authors) were ideal with regard to all of the above goals. We were discussing this issue with Eric Brewer when
he had a key insight about hash table log merging: that doing so could potentially be made as fast or faster than
LSM-trees.
   This led us to explore some of the techniques used in the log-structured file systems first developed in the
1980s and 1990s in a new light. That exploration led to the development of Bitcask, a storage system that meets
all of the above goals very well. While Bitcask was originally developed with a goal of being used under Riak,
it was built to be generic and can serve as a local key/value store for other applications as well.




Copyright 2010, Basho Technologies


                                                     1                                                     2010/4/27
The model we ended up going with is conceptually very simple. A Bitcask instance is a directory, and we
enforce that only one operating system process will open that Bitcask for writing at a given time. You can think
of that process effectively as the ”database server”. At any moment, one file is ”active” in that directory for
writing by the server. When that file meets a size threshold it will be closed and a new active file will be created.
Once a file is closed, either purposefully or due to server exit, it is considered immutable and will never be
opened for writing again.




  The active file is only written by appending, which means that sequential writes do not require disk seeking.
The format that is written for each key/value entry is simple:




   With each write, a new entry is appended to the active file. Note that deletion is simply a write of a special
tombstone value, which will be removed on the next merge. Thus, a Bitcask data file is nothing more than a
linear sequence of these entries:




                                                     2                                                     2010/4/27
After the append completes, an in-memory structure called a ”keydir” is updated. A keydir is simply a hash
table that maps every key in a Bitcask to a fixed-size structure giving the file, offset, and size of the most recently
written entry for that key.




    When a write occurs, the keydir is atomically updated with the location of the newest data. The old data is
still present on disk, but any new reads will use the latest version available in the keydir. As we’ll see later, the
merge process will eventually remove the old value.
    Reading a value is simple, and doesn’t ever require more than a single disk seek. We look up the key in our
keydir, and from there we read the data using the file id, position, and size that are returned from that lookup. In
many cases, the operating system’s filesystem read-ahead cache makes this a much faster operation than would
be otherwise expected.




                                                      3                                                      2010/4/27
This simple model may use up a lot of space over time, since we just write out new values without touching
the old ones. A process for compaction that we refer to as ”merging” solves this. The merge process iterates
over all non-active (i.e. immutable) files in a Bitcask and produces as output a set of data files containing only
the ”live” or latest versions of each present key.
   When this is done we also create a ”hint file” next to each data file. These are essentially like the data files
but instead of the values they contain the position and size of the values within the corresponding data file.




   When a Bitcask is opened by an Erlang process, it checks to see if there is already another Erlang process in
the same VM that is using that Bitcask. If so, it will share the keydir with that process. If not, it scans all of the
data files in a directory in order to build a new keydir. For any data file that has a hint file, that will be scanned
instead for a much quicker startup time.
   These basic operations are the essence of the bitcask system. Obviously, we’ve not tried to expose every detail
of operations in this document; our goal here is to help you understand the general mechanisms of Bitcask. Some
additional notes on a couple of areas we breezed past are probably in order:

 • We mentioned that we rely on the operating system’s filesystem cache for read performance. We have
   discussed adding a bitcask-internal read cache as well, but given how much mileage we get for free right
   now it’s unclear how much that will pay off.
 • We will present benchmarks against various API-similar local storage systems very soon. However, our
   initial goal with Bitcask was not to be the fastest storage engine but rather to get ”enough” speed and also
   high quality and simplicity of code, design, and file format. That said, in our initial simple benchmarking we
   have seen Bitcask handily outperform other fast storage systems for many scenarios.
 • Some of the hardest implementation details are also the least interesting to most outsiders, so we haven’t
   included in this short document a description of (e.g.) the internal keydir locking scheme.
 • Bitcask does not perform any compression of data, as the cost/benefit of doing so is very application-
   dependent.



                                                      4                                                       2010/4/27
And let’s look at the goals we had when we set out:




• low latency per item read or written

  Bitcask is fast. We plan on doing more thorough benchmarks soon, but with sub-millisecond typical median
  latency (and quite adequate higher percentiles) in our early tests we are confident that it can be made to meet
  our speed goals.

• high throughput, especially when writing an incoming stream of random items

  In early tests on a laptop with slow disks, we have seen throughput of 5000-6000 writes per second.

• ability to handle datasets much larger than RAM w/o degradation

  The tests mentioned above used a dataset of more than 10×RAM on the system in question, and showed no
  sign of changed behavior at that point. This is consistent with our expectations given the design of Bitcask.

• crash friendliness, both in terms of fast recovery and not losing data

  As the data files and the commit log are the same thing in Bitcask, recovery is trivial with no need for
  ”replay.” The hint files can be used to make the startup process speedy.

• ease of backup and restore

  Since the files are immutable after rotation, backup can use whatever system-level mechanism is preferred by
  the operator with ease. Restoration requires nothing more than placing the data files in the desired directory.

• a relatively simple, understandable (and thus supportable) code structure and data format

  Bitcask is conceptually simple, clean code, and the data files are very easy to understand and manage. We
  feel very comfortable supporting a system resting atop Bitcask.

• predictable behavior under heavy access load or large volume

  Under heavy access load we’ve already seen Bitcask do well. So far it has only seen double-digit gigabyte
  volumes, but we’ll be testing it with more soon. The shape of Bitcask is such that we do not expect it to
  perform too differently under larger volume, with the one predictable exception that the keydir structure
  grows by a small amount with the number of keys and must fit entirely in RAM. This limitation is minor in
  practice, as even with many millions of keys the current implementation uses well under a GB of memory.




 In summary, given this specific set of goals, Bitcask suits our needs better than anything else we had available.

                                                   5                                                     2010/4/27
The API is quite simple:
 bitcask:open(DirectoryName, Opts)          Open a new or existing Bitcask datastore with additional options.
 → BitCaskHandle | {error, any()}           Valid options include read write (if this process is going to be a
                                            writer and not just a reader) and sync on put (if this writer would
                                            prefer to sync the write file after every write operation).
                                            The directory must be readable and writable by this process, and
                                            only one process may open a Bitcask with read write at a time.
 bitcask:open(DirectoryName)                Open a new or existing Bitcask datastore for read-only access.
 → BitCaskHandle | {error, any()}           The directory and all files in it must be readable by this process.
 bitcask:get(BitCaskHandle, Key)            Retrieve a value by key from a Bitcask datastore.
 → not found | {ok, Value}
 bitcask:put(BitCaskHandle, Key, Value) Store a key and value in a Bitcask datastore.
 → ok | {error, any()}
 bitcask:delete(BitCaskHandle, Key)          Delete a key from a Bitcask datastore.
 → ok | {error, any()}
 bitcask:list keys(BitCaskHandle)            List all keys in a Bitcask datastore.
 → [Key] | {error, any()}
 bitcask:fold(BitCaskHandle,Fun,Acc0)        Fold over all K/V pairs in a Bitcask datastore.
 → Acc                                       Fun is expected to be of the form: F(K,V,Acc0) → Acc.
 bitcask:merge(DirectoryName)                Merge several data files within a Bitcask datastore into a more
 → ok | {error, any()}                       compact form. Also, produce hintfiles for faster startup.
 bitcask:sync(BitCaskHandle)                 Force any writes to sync to disk.
 → ok
 bitcask:close(BitCaskHandle)                Close a Bitcask data store and flush all pending writes
 → ok                                        (if any) to disk.
As you can see, Bitcask is quite simple to use. Enjoy!




                                                6                                                    2010/4/27

More Related Content

What's hot

No SQL - MongoDB
No SQL - MongoDBNo SQL - MongoDB
No SQL - MongoDB
Mirza Asif
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB Internals
Siraj Memon
 
WiredTiger Overview
WiredTiger OverviewWiredTiger Overview
WiredTiger Overview
WiredTiger
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
Norberto Leite
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
David Green
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
NoSQLmatters
 
Remote DBA Experts 11g Features
Remote DBA Experts 11g FeaturesRemote DBA Experts 11g Features
Remote DBA Experts 11g Features
Remote DBA Experts
 
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDBPowering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
MongoDB
 
AzureDocumentDB
AzureDocumentDBAzureDocumentDB
AzureDocumentDB
Saravanan G
 
Datastores
DatastoresDatastores
Datastores
Raveen Vijayan
 
Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise
MongoDB
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
MongoDB
 
Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0
MongoDB
 
Azure DocumentDB
Azure DocumentDBAzure DocumentDB
Azure DocumentDB
Neil Mackenzie
 
Simplify Complex Query with CQRS
Simplify Complex Query with CQRSSimplify Complex Query with CQRS
Simplify Complex Query with CQRS
Jacky Lai
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB
 
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB
 
Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesign
MongoDB APAC
 
MongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
MongoDB Miami Meetup 1/26/15: Introduction to WiredTigerMongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
MongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
Valeri Karpov
 
GlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System AdministratorsGlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System Administrators
Globus
 

What's hot (20)

No SQL - MongoDB
No SQL - MongoDBNo SQL - MongoDB
No SQL - MongoDB
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB Internals
 
WiredTiger Overview
WiredTiger OverviewWiredTiger Overview
WiredTiger Overview
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
Remote DBA Experts 11g Features
Remote DBA Experts 11g FeaturesRemote DBA Experts 11g Features
Remote DBA Experts 11g Features
 
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDBPowering Microservices with Docker, Kubernetes, Kafka, and MongoDB
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
 
AzureDocumentDB
AzureDocumentDBAzureDocumentDB
AzureDocumentDB
 
Datastores
DatastoresDatastores
Datastores
 
Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise Securing Your Enterprise Web Apps with MongoDB Enterprise
Securing Your Enterprise Web Apps with MongoDB Enterprise
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0
 
Azure DocumentDB
Azure DocumentDBAzure DocumentDB
Azure DocumentDB
 
Simplify Complex Query with CQRS
Simplify Complex Query with CQRSSimplify Complex Query with CQRS
Simplify Complex Query with CQRS
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
MongoDB .local Bengaluru 2019: The Journey of Migration from Oracle to MongoD...
 
Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesign
 
MongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
MongoDB Miami Meetup 1/26/15: Introduction to WiredTigerMongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
MongoDB Miami Meetup 1/26/15: Introduction to WiredTiger
 
GlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System AdministratorsGlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System Administrators
 

Viewers also liked

Slide 9 april_2011_pdf
Slide 9 april_2011_pdfSlide 9 april_2011_pdf
Slide 9 april_2011_pdf
N3A Media Limited Partnership
 
Norwaynd (3)
Norwaynd (3)Norwaynd (3)
Norwaynd (3)
Nasdly Diaz
 
Spanish table quiz a 2010
Spanish table quiz a 2010Spanish table quiz a 2010
Spanish table quiz a 2010
Thierry4
 
Flex 101
Flex 101Flex 101
Bubuk kedelai
Bubuk kedelaiBubuk kedelai
Bubuk kedelai
Anggun Merdekawati
 
Web rtc
Web rtcWeb rtc

Viewers also liked (7)

Slide 9 april_2011_pdf
Slide 9 april_2011_pdfSlide 9 april_2011_pdf
Slide 9 april_2011_pdf
 
Five minuterule
Five minuteruleFive minuterule
Five minuterule
 
Norwaynd (3)
Norwaynd (3)Norwaynd (3)
Norwaynd (3)
 
Spanish table quiz a 2010
Spanish table quiz a 2010Spanish table quiz a 2010
Spanish table quiz a 2010
 
Flex 101
Flex 101Flex 101
Flex 101
 
Bubuk kedelai
Bubuk kedelaiBubuk kedelai
Bubuk kedelai
 
Web rtc
Web rtcWeb rtc
Web rtc
 

Similar to 随即存储引擎Bitcask

No sql presentation
No sql presentationNo sql presentation
No sql presentation
Saifuddin Kaijar
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
Rob Winters
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
Vikram Shinde
 
Turning object storage into vm storage
Turning object storage into vm storageTurning object storage into vm storage
Turning object storage into vm storage
wim_provoost
 
Presentation day1oracle 12c
Presentation day1oracle 12cPresentation day1oracle 12c
Presentation day1oracle 12c
Pradeep Srivastava
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
Alfresco Software
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
IJECEIAES
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
Mathew Beane
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
André Mayer
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
Ceph Community
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ libraryInterview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
PVS-Studio
 
Mulesoft ELK
Mulesoft ELKMulesoft ELK
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
Yoni Farin
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Karthik Babu Sekar
 
Mysql wp memcached
Mysql wp memcachedMysql wp memcached
Mysql wp memcached
sharad chhetri
 
Introducing Mache
Introducing MacheIntroducing Mache
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Karthik Babu Sekar
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
RithikRaj25
 

Similar to 随即存储引擎Bitcask (20)

No sql presentation
No sql presentationNo sql presentation
No sql presentation
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Turning object storage into vm storage
Turning object storage into vm storageTurning object storage into vm storage
Turning object storage into vm storage
 
Presentation day1oracle 12c
Presentation day1oracle 12cPresentation day1oracle 12c
Presentation day1oracle 12c
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ libraryInterview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
 
Mulesoft ELK
Mulesoft ELKMulesoft ELK
Mulesoft ELK
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !! Couchbase Singapore Meetup #2:  Why Developing with Couchbase is easy !!
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
 
Mysql wp memcached
Mysql wp memcachedMysql wp memcached
Mysql wp memcached
 
Introducing Mache
Introducing MacheIntroducing Mache
Introducing Mache
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Couchbase Chennai Meetup:  Developing with Couchbase- made easyCouchbase Chennai Meetup:  Developing with Couchbase- made easy
Couchbase Chennai Meetup: Developing with Couchbase- made easy
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 

More from Institute of Computing Technology, Chinese Academy of Sciences

C make tutorial
C make tutorialC make tutorial
Distributedsystems 100912185813-phpapp01
Distributedsystems 100912185813-phpapp01Distributedsystems 100912185813-phpapp01
Distributedsystems 100912185813-phpapp01
Institute of Computing Technology, Chinese Academy of Sciences
 
Gfs andmapreduce
Gfs andmapreduceGfs andmapreduce
二叉查找树范围查询算法
二叉查找树范围查询算法二叉查找树范围查询算法
Skip list
Skip listSkip list
Lsm
LsmLsm
Hfile格式详细介绍
Hfile格式详细介绍Hfile格式详细介绍

More from Institute of Computing Technology, Chinese Academy of Sciences (8)

C make tutorial
C make tutorialC make tutorial
C make tutorial
 
Distributedsystems 100912185813-phpapp01
Distributedsystems 100912185813-phpapp01Distributedsystems 100912185813-phpapp01
Distributedsystems 100912185813-phpapp01
 
Gfs andmapreduce
Gfs andmapreduceGfs andmapreduce
Gfs andmapreduce
 
二叉查找树范围查询算法
二叉查找树范围查询算法二叉查找树范围查询算法
二叉查找树范围查询算法
 
leveldb Log format
leveldb Log formatleveldb Log format
leveldb Log format
 
Skip list
Skip listSkip list
Skip list
 
Lsm
LsmLsm
Lsm
 
Hfile格式详细介绍
Hfile格式详细介绍Hfile格式详细介绍
Hfile格式详细介绍
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 

随即存储引擎Bitcask

  • 1. Bitcask A Log-Structured Hash Table for Fast Key/Value Data Justin Sheehy David Smith with inspiration from Eric Brewer Basho Technologies justin@basho.com,dizzyd@basho.com The origin of Bitcask is tied to the history of the Riak distributed database. In a Riak key/value cluster, each node uses pluggable local storage; nearly anything k/v-shaped can be used as the per-host storage engine. This pluggability allowed progress on Riak to be parallelized such that storage engines could be improved and tested without impact on the rest of the codebase. Many such local key/value stores already exist, including but not limited to Berkeley DB, Tokyo Cabinet, and Innostore. There are many goals we sought when evaluating such storage engines, including: • low latency per item read or written • high throughput, especially when writing an incoming stream of random items • ability to handle datasets much larger than RAM w/o degradation • crash friendliness, both in terms of fast recovery and not losing data • ease of backup and restore • a relatively simple, understandable (and thus supportable) code structure and data format • predictable behavior under heavy access load or large volume • a license that allowed for easy default use in Riak Achieving some of these is easy. Achieving them all is less so. None of the local key/value storage systems available (including but not limited to those written by the authors) were ideal with regard to all of the above goals. We were discussing this issue with Eric Brewer when he had a key insight about hash table log merging: that doing so could potentially be made as fast or faster than LSM-trees. This led us to explore some of the techniques used in the log-structured file systems first developed in the 1980s and 1990s in a new light. That exploration led to the development of Bitcask, a storage system that meets all of the above goals very well. While Bitcask was originally developed with a goal of being used under Riak, it was built to be generic and can serve as a local key/value store for other applications as well. Copyright 2010, Basho Technologies 1 2010/4/27
  • 2. The model we ended up going with is conceptually very simple. A Bitcask instance is a directory, and we enforce that only one operating system process will open that Bitcask for writing at a given time. You can think of that process effectively as the ”database server”. At any moment, one file is ”active” in that directory for writing by the server. When that file meets a size threshold it will be closed and a new active file will be created. Once a file is closed, either purposefully or due to server exit, it is considered immutable and will never be opened for writing again. The active file is only written by appending, which means that sequential writes do not require disk seeking. The format that is written for each key/value entry is simple: With each write, a new entry is appended to the active file. Note that deletion is simply a write of a special tombstone value, which will be removed on the next merge. Thus, a Bitcask data file is nothing more than a linear sequence of these entries: 2 2010/4/27
  • 3. After the append completes, an in-memory structure called a ”keydir” is updated. A keydir is simply a hash table that maps every key in a Bitcask to a fixed-size structure giving the file, offset, and size of the most recently written entry for that key. When a write occurs, the keydir is atomically updated with the location of the newest data. The old data is still present on disk, but any new reads will use the latest version available in the keydir. As we’ll see later, the merge process will eventually remove the old value. Reading a value is simple, and doesn’t ever require more than a single disk seek. We look up the key in our keydir, and from there we read the data using the file id, position, and size that are returned from that lookup. In many cases, the operating system’s filesystem read-ahead cache makes this a much faster operation than would be otherwise expected. 3 2010/4/27
  • 4. This simple model may use up a lot of space over time, since we just write out new values without touching the old ones. A process for compaction that we refer to as ”merging” solves this. The merge process iterates over all non-active (i.e. immutable) files in a Bitcask and produces as output a set of data files containing only the ”live” or latest versions of each present key. When this is done we also create a ”hint file” next to each data file. These are essentially like the data files but instead of the values they contain the position and size of the values within the corresponding data file. When a Bitcask is opened by an Erlang process, it checks to see if there is already another Erlang process in the same VM that is using that Bitcask. If so, it will share the keydir with that process. If not, it scans all of the data files in a directory in order to build a new keydir. For any data file that has a hint file, that will be scanned instead for a much quicker startup time. These basic operations are the essence of the bitcask system. Obviously, we’ve not tried to expose every detail of operations in this document; our goal here is to help you understand the general mechanisms of Bitcask. Some additional notes on a couple of areas we breezed past are probably in order: • We mentioned that we rely on the operating system’s filesystem cache for read performance. We have discussed adding a bitcask-internal read cache as well, but given how much mileage we get for free right now it’s unclear how much that will pay off. • We will present benchmarks against various API-similar local storage systems very soon. However, our initial goal with Bitcask was not to be the fastest storage engine but rather to get ”enough” speed and also high quality and simplicity of code, design, and file format. That said, in our initial simple benchmarking we have seen Bitcask handily outperform other fast storage systems for many scenarios. • Some of the hardest implementation details are also the least interesting to most outsiders, so we haven’t included in this short document a description of (e.g.) the internal keydir locking scheme. • Bitcask does not perform any compression of data, as the cost/benefit of doing so is very application- dependent. 4 2010/4/27
  • 5. And let’s look at the goals we had when we set out: • low latency per item read or written Bitcask is fast. We plan on doing more thorough benchmarks soon, but with sub-millisecond typical median latency (and quite adequate higher percentiles) in our early tests we are confident that it can be made to meet our speed goals. • high throughput, especially when writing an incoming stream of random items In early tests on a laptop with slow disks, we have seen throughput of 5000-6000 writes per second. • ability to handle datasets much larger than RAM w/o degradation The tests mentioned above used a dataset of more than 10×RAM on the system in question, and showed no sign of changed behavior at that point. This is consistent with our expectations given the design of Bitcask. • crash friendliness, both in terms of fast recovery and not losing data As the data files and the commit log are the same thing in Bitcask, recovery is trivial with no need for ”replay.” The hint files can be used to make the startup process speedy. • ease of backup and restore Since the files are immutable after rotation, backup can use whatever system-level mechanism is preferred by the operator with ease. Restoration requires nothing more than placing the data files in the desired directory. • a relatively simple, understandable (and thus supportable) code structure and data format Bitcask is conceptually simple, clean code, and the data files are very easy to understand and manage. We feel very comfortable supporting a system resting atop Bitcask. • predictable behavior under heavy access load or large volume Under heavy access load we’ve already seen Bitcask do well. So far it has only seen double-digit gigabyte volumes, but we’ll be testing it with more soon. The shape of Bitcask is such that we do not expect it to perform too differently under larger volume, with the one predictable exception that the keydir structure grows by a small amount with the number of keys and must fit entirely in RAM. This limitation is minor in practice, as even with many millions of keys the current implementation uses well under a GB of memory. In summary, given this specific set of goals, Bitcask suits our needs better than anything else we had available. 5 2010/4/27
  • 6. The API is quite simple: bitcask:open(DirectoryName, Opts) Open a new or existing Bitcask datastore with additional options. → BitCaskHandle | {error, any()} Valid options include read write (if this process is going to be a writer and not just a reader) and sync on put (if this writer would prefer to sync the write file after every write operation). The directory must be readable and writable by this process, and only one process may open a Bitcask with read write at a time. bitcask:open(DirectoryName) Open a new or existing Bitcask datastore for read-only access. → BitCaskHandle | {error, any()} The directory and all files in it must be readable by this process. bitcask:get(BitCaskHandle, Key) Retrieve a value by key from a Bitcask datastore. → not found | {ok, Value} bitcask:put(BitCaskHandle, Key, Value) Store a key and value in a Bitcask datastore. → ok | {error, any()} bitcask:delete(BitCaskHandle, Key) Delete a key from a Bitcask datastore. → ok | {error, any()} bitcask:list keys(BitCaskHandle) List all keys in a Bitcask datastore. → [Key] | {error, any()} bitcask:fold(BitCaskHandle,Fun,Acc0) Fold over all K/V pairs in a Bitcask datastore. → Acc Fun is expected to be of the form: F(K,V,Acc0) → Acc. bitcask:merge(DirectoryName) Merge several data files within a Bitcask datastore into a more → ok | {error, any()} compact form. Also, produce hintfiles for faster startup. bitcask:sync(BitCaskHandle) Force any writes to sync to disk. → ok bitcask:close(BitCaskHandle) Close a Bitcask data store and flush all pending writes → ok (if any) to disk. As you can see, Bitcask is quite simple to use. Enjoy! 6 2010/4/27