SlideShare a Scribd company logo
1 of 28
Download to read offline
Vinyl:
why we wrote our own
write-optimized storage
engine rather than chose
RocksDB
kostja@tarantool.org
Konstantin Osipov
Plan
● A review of log structured merge tree data structure
● Vinyl engine in Tarantool
● Configuration parameters
● Key use cases
Why build, rather than buy: rocksdb, forestdb, ...
Transaction Control
- read your own writes
- multi-engine
- undo log
Engine 1
indexing
checkpoint
storage mgmt
Write ahead log
- redo log
- async relay
- group replication
I/O
- session
- parser
Engine 2
…
…
...
The shape of a classical LSM
DELETE in an LSM tree
● We can’t delete from append-only files
● Tombstones (delete markers) are inserted into L0 instead
26
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 37
Memory
DELETE in an LSM tree
Disk
3 8 15 26 35 40 45 48
10 25 36 42
22 26
Memory
37
DELETE in an LSM tree
Disk
3 8 10 15 22 35 36 37
Memory
40 42 45 4825
SELECT in LSM tree
● Search in all levels, until the key is found
● Optimistic for point queries
● Merge sort in case of range select
SELECT in LSM tree
Disk
2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49
3 8 15 26 35 40 45 48
10 25 36 42
22 37
GET(16) Memory
LSM: the algorithmic limit
LSM tree B-tree
Search K * O(log2
N/B) O(logB
N)
Delete O(log2
(N)/B) O(logB
N)
Insert O(log2
(N)/B) O(logB
N)
RUM conjecture
The ubiquitous fight between
the Read, the Update, and the
Memory overhead of access
methods for modern data
systems
Key LSM challenges in Web/OLTP
● Slow reads: read amplification
● Potentially write the same data multiple times: write
amplification
● Keep garbage around: space amplification
● Response times affected by background activity: latency spikes
Vinyl: memtable, sorted run, dump & compact
● Stores statements, not values:
● REPLACE,
● DELETE,
● UPSERT
● Every statement is marked by LSN
● Append-only files, garbage collected after checkpoint
● Transactional log of all filesystem changes: vylog
key lsn op_code value
Memtable, sorted run, dump & compact
Disk
3 8 15 26 36 40 45
10 25 36 40
Memory
memtable
22 37
dump
sorted run
compact
3 8 10 15 25 26 36 40 45
Vinyl: read [25, 30)
3 8 15 26
1 25 36
15 26 29
4
25 26 29merge
Uncommitted changes
Tuple cache
L0
Sorted run 83
Keeping the latency within limits
● anticipatory dump
● throttling
Disk
Memory
memtable
for writes
dump
3 8 10 15 25 26 36 40 45
22 37
22 37
shadow
Reducing read amplification
● page index
● bloom filters
● tuple range cache
● multi-level compaction
Disk22 37
3 8 10 15 25 26 36 40 45
3 15 22 36 22 25 26
page index tuple cachewrites
001010101010100110
110101011100000000
Bloom filter
Reducing write amplification
● Multi-level compaction can span any number of levels
● A level can contain multiple runs
Disk65 92
2 19 26 29 32 38 43 46 79
writes
14 15 35 89
22 23 27 31 37 46 65 78 94
3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99
Reducing space amplification
● Ranges reflect a static layout of sorted runs
● Slices connect a sorted run into a range
Disk65 92
2 3 9 14 15 22
14 15 35 89
23 27 28 31 33 37 44 81 93 94 96 98
Memory
[-inf, 23) [-23, 45) [-45, +inf)
Secondary key support
● L0 and tuple cache use memtx technology
● This makes it possible to include L0 and tuple cache in
checkpoint, so no cold restart
● L1+ stores the value of primary key as tuple id
● Unique secondary keys rely on bloom filter heavily for
INSERT, REPLACE
● REPLACE in a non-unique secondary is a blind write, garbage
is pruned at compaction
The scheduler
● The only active entity inside vinyl
engine
● Maintains two queues:
compact and dump
● Each range is in each queue,
parallel compaction
● Dump always trumps compact
● Chunked garbage collection for
quick memory reclamation
● Is a fiber, so needs no locks
Transactions
● MVCC
● the first transaction to commit wins
● no waits, no deadlocks, only aborts
● yields don’t abort transactions
Replication & backup
Work out of the box
Limitations
● Tuple size is up to 16M
● You need ~5k file descriptors per 1TB of data
● Optimistic transaction control is bad for long read-write
transactions
● No cross-engine transactions yet
Configuration
Name Description Default
bloom_fpr Bloom filter max false positive rate 5%
page_size Approximate page size 8192
range_size Approximate range size 1G
run_count_per_level How many sorted runs are allowed on a
single level
2
run_size_ratio Run size ratio between levels 3.5
memory L0 total size 0.25G
cache Tuple cache size 0.25G
threads The number of compaction threads 2
UPSERT
● Non-reading update or insert
● Delayed execution
● Background upsert squashing prevents upserts from piling up
space:upsert(tuple, {{operator, field, value}, ...})
@kostja_osipov fb.com/TarantoolDatabase
www.tarantool.orgkostja@tarantool.org
Thank you! Questions?
Links
Leveled compaction in Apache Cassandra
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
http://www.scylladb.com/kb/compaction/
https://github.com/facebook/rocksdb/wiki/Universal-Compaction
https://dom.as/2015/04/09/how-innodb-lost-its-advantage/

More Related Content

Similar to vinyl

Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdffengxun
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesAlkin Tezuysal
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale:  an Intelligent Database ProxyMariaDB MaxScale:  an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixDatabricks
 
M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsMariaDB plc
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxVinicius M Grippa
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfssuser30e7d2
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterRed_Hat_Storage
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityPythian
 
MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?FromDual GmbH
 
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)data://disrupted®
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
MySQL configuration - The most important Variables
MySQL configuration - The most important VariablesMySQL configuration - The most important Variables
MySQL configuration - The most important VariablesFromDual GmbH
 

Similar to vinyl (20)

Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar Series
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale:  an Intelligent Database ProxyMariaDB MaxScale:  an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database Proxy
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
M|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write PathsM|18 Deep Dive: InnoDB Transactions and Write Paths
M|18 Deep Dive: InnoDB Transactions and Write Paths
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database Proxy
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
 
MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?MariaDB / MySQL tripping hazard and how to get out again?
MariaDB / MySQL tripping hazard and how to get out again?
 
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
Operation Unthinkable – Software Defined Storage @ Booking.com (Peter Buschman)
 
Linux logging
Linux loggingLinux logging
Linux logging
 
MySQL DBA
MySQL DBAMySQL DBA
MySQL DBA
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
MySQL configuration - The most important Variables
MySQL configuration - The most important VariablesMySQL configuration - The most important Variables
MySQL configuration - The most important Variables
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

vinyl

  • 1. Vinyl: why we wrote our own write-optimized storage engine rather than chose RocksDB kostja@tarantool.org Konstantin Osipov
  • 2. Plan ● A review of log structured merge tree data structure ● Vinyl engine in Tarantool ● Configuration parameters ● Key use cases
  • 3. Why build, rather than buy: rocksdb, forestdb, ... Transaction Control - read your own writes - multi-engine - undo log Engine 1 indexing checkpoint storage mgmt Write ahead log - redo log - async relay - group replication I/O - session - parser Engine 2 … … ...
  • 4. The shape of a classical LSM
  • 5. DELETE in an LSM tree ● We can’t delete from append-only files ● Tombstones (delete markers) are inserted into L0 instead 26 Disk 3 8 15 26 35 40 45 48 10 25 36 42 22 37 Memory
  • 6. DELETE in an LSM tree Disk 3 8 15 26 35 40 45 48 10 25 36 42 22 26 Memory 37
  • 7. DELETE in an LSM tree Disk 3 8 10 15 22 35 36 37 Memory 40 42 45 4825
  • 8. SELECT in LSM tree ● Search in all levels, until the key is found ● Optimistic for point queries ● Merge sort in case of range select
  • 9. SELECT in LSM tree Disk 2 6 7 10 14 16 23 28 30 32 37 38 41 45 47 49 3 8 15 26 35 40 45 48 10 25 36 42 22 37 GET(16) Memory
  • 10. LSM: the algorithmic limit LSM tree B-tree Search K * O(log2 N/B) O(logB N) Delete O(log2 (N)/B) O(logB N) Insert O(log2 (N)/B) O(logB N)
  • 11. RUM conjecture The ubiquitous fight between the Read, the Update, and the Memory overhead of access methods for modern data systems
  • 12. Key LSM challenges in Web/OLTP ● Slow reads: read amplification ● Potentially write the same data multiple times: write amplification ● Keep garbage around: space amplification ● Response times affected by background activity: latency spikes
  • 13. Vinyl: memtable, sorted run, dump & compact ● Stores statements, not values: ● REPLACE, ● DELETE, ● UPSERT ● Every statement is marked by LSN ● Append-only files, garbage collected after checkpoint ● Transactional log of all filesystem changes: vylog key lsn op_code value
  • 14. Memtable, sorted run, dump & compact Disk 3 8 15 26 36 40 45 10 25 36 40 Memory memtable 22 37 dump sorted run compact 3 8 10 15 25 26 36 40 45
  • 15. Vinyl: read [25, 30) 3 8 15 26 1 25 36 15 26 29 4 25 26 29merge Uncommitted changes Tuple cache L0 Sorted run 83
  • 16. Keeping the latency within limits ● anticipatory dump ● throttling Disk Memory memtable for writes dump 3 8 10 15 25 26 36 40 45 22 37 22 37 shadow
  • 17. Reducing read amplification ● page index ● bloom filters ● tuple range cache ● multi-level compaction Disk22 37 3 8 10 15 25 26 36 40 45 3 15 22 36 22 25 26 page index tuple cachewrites 001010101010100110 110101011100000000 Bloom filter
  • 18. Reducing write amplification ● Multi-level compaction can span any number of levels ● A level can contain multiple runs Disk65 92 2 19 26 29 32 38 43 46 79 writes 14 15 35 89 22 23 27 31 37 46 65 78 94 3 9 28 33 44 47 48 54 56 59 61 66 75 81 93 94 96 98 99
  • 19. Reducing space amplification ● Ranges reflect a static layout of sorted runs ● Slices connect a sorted run into a range Disk65 92 2 3 9 14 15 22 14 15 35 89 23 27 28 31 33 37 44 81 93 94 96 98 Memory [-inf, 23) [-23, 45) [-45, +inf)
  • 20. Secondary key support ● L0 and tuple cache use memtx technology ● This makes it possible to include L0 and tuple cache in checkpoint, so no cold restart ● L1+ stores the value of primary key as tuple id ● Unique secondary keys rely on bloom filter heavily for INSERT, REPLACE ● REPLACE in a non-unique secondary is a blind write, garbage is pruned at compaction
  • 21. The scheduler ● The only active entity inside vinyl engine ● Maintains two queues: compact and dump ● Each range is in each queue, parallel compaction ● Dump always trumps compact ● Chunked garbage collection for quick memory reclamation ● Is a fiber, so needs no locks
  • 22. Transactions ● MVCC ● the first transaction to commit wins ● no waits, no deadlocks, only aborts ● yields don’t abort transactions
  • 23. Replication & backup Work out of the box
  • 24. Limitations ● Tuple size is up to 16M ● You need ~5k file descriptors per 1TB of data ● Optimistic transaction control is bad for long read-write transactions ● No cross-engine transactions yet
  • 25. Configuration Name Description Default bloom_fpr Bloom filter max false positive rate 5% page_size Approximate page size 8192 range_size Approximate range size 1G run_count_per_level How many sorted runs are allowed on a single level 2 run_size_ratio Run size ratio between levels 3.5 memory L0 total size 0.25G cache Tuple cache size 0.25G threads The number of compaction threads 2
  • 26. UPSERT ● Non-reading update or insert ● Delayed execution ● Background upsert squashing prevents upserts from piling up space:upsert(tuple, {{operator, field, value}, ...})
  • 28. Links Leveled compaction in Apache Cassandra http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra http://www.scylladb.com/kb/compaction/ https://github.com/facebook/rocksdb/wiki/Universal-Compaction https://dom.as/2015/04/09/how-innodb-lost-its-advantage/