Like this presentation? Why not share!

# 2011.06.20 stratified-btree

## on Jul 07, 2011

• 3,264 views

### Views

Total Views
3,264
Views on SlideShare
3,261
Embed Views
3

Likes
3
0
0

### Report content

• Comment goes here.
Are you sure you want to
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• LolCoW. if you want to do fast updates, then CoW technique cannot help -- the cow is built around the assumption that every update can do a lookup, and update reference counts\n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• The crucial notion is density. A versioned array, a version tree and its layout on disk. Versions v1,v2,v3 are tagged, so dark entries are lead entries.\nThe entry (k0,v0,x) is written in v0, so it is not a lead entry, but it is live at v1,v2 and v3. Similarly, (k1, v0, x) is live at v1 and v3 (since it was not overwritten at v1) but not at v2.\nThe live counts are as follows: live(v1) = 4, live(v2) = 4, live(v3) = 4, density = 4/8.\nIn practice, the on-disk layout can be compressed by writing the key once for all the versions, and other well-known techniques.\n
• The crucial notion is density. A versioned array, a version tree and its layout on disk. Versions v1,v2,v3 are tagged, so dark entries are lead entries.\nThe entry (k0,v0,x) is written in v0, so it is not a lead entry, but it is live at v1,v2 and v3. Similarly, (k1, v0, x) is live at v1 and v3 (since it was not overwritten at v1) but not at v2.\nThe live counts are as follows: live(v1) = 4, live(v2) = 4, live(v3) = 4, density = 4/8.\nIn practice, the on-disk layout can be compressed by writing the key once for all the versions, and other well-known techniques.\n
• \n
• \n
• Example of density amplification. The merged array has density $\\frac{2}{11} &lt; \\frac{1}{5}$, so it is not dense. We find a split into two parts: the first split $(A_{1},\\{v_{0},v_{5}\\})$ has size 4 and density $\\frac{1}{2}$. The second split $(A_{2},\\{v_{4}, v_{1}, v_{2},v_{3}\\})$ has size 7 and density $\\frac{2}{7}$. Both splits have size $&lt; 2^{l+1}.$$\n\nIf no such set exists at v, we recurse into the child v_{i} maximizing |\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;}[v_{i}])|. It is possible to show that this always finds a dense split. Once such a set \\mathcal{U} is identified, the corresponding array is written out, and we recurse on the remainder \\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;} \\setminus \\mathcal{U}). Figure \\ref{fig:split} gives an example of density amplification.\n\n\n • Example of density amplification. The merged array has density \\frac{2}{11} &lt; \\frac{1}{5}, so it is not dense. We find a split into two parts: the first split (A_{1},\\{v_{0},v_{5}\\}) has size 4 and density \\frac{1}{2}. The second split (A_{2},\\{v_{4}, v_{1}, v_{2},v_{3}\\}) has size 7 and density \\frac{2}{7}. Both splits have size &lt; 2^{l+1}.$$ \n\nIf no such set exists at$v$, we recurse into the child$v_{i}$maximizing$|\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;}[v_{i}])|$. It is possible to show that this always finds a dense split. Once such a set$\\mathcal{U}$is identified, the corresponding array is written out, and we recurse on the remainder$\\mathrm{split}(\\mathcal{A&apos;}, \\mathcal{W&apos;} \\setminus \\mathcal{U})$. Figure \\ref{fig:split} gives an example of density amplification.\n\n\n • \n • \n • \n • \n • The plot shows range query performance (elements/s extracted using range queries of size 1000).\nThe CoW B-tree is limited by random IO here ((100/s*32KB)/(200 bytes/key) = 16384 key/s), but the Stratified B-tree is CPU-bound (OCaml is single-threaded).\nPreliminary performance results from a highly-concurrent in-kernel implementation suggest that well over 500k updates/s are possible with 16 cores \n • \n ## 2011.06.20 stratified-btreePresentation Transcript • Big problems, Massive dataStratiﬁed B-trees • Versioned dictionaries• put(k,ver,data) Monday 12:00 v10• get(k_start,k_end,ver)• clone(v): create a child of v Monday 16:00 v11 that inherits the latest version of its keys Now v12 • Versioned dictionaries• put(k,ver,data) Monday 12:00 v10• get(k_start,k_end,ver)• clone(v): create a child of v Monday 16:00 v11 that inherits the latest version of its keys Now v12 This talk: a versioned dictionary with fast updates, and optimal space/query/update tradeoffs • Why? • Powerful: cloning, time-travel, cache and space-efﬁciency, ...Monday 12:00 v10 • Give developers a recent branch of live datasetMonday 16:00 v11 • Expose different views of same base dataset Now v12 v13 Run analytics/tests/etc on this clone, without performance impact. • State of the art: copy-on-write Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to the B-tree • State of the art: copy-on-write Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above • Needs random IO to scale • Concurrency is tricky • State of the art: copy-on-write Used in ZFS, WAFL, Btrfs, ... Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above • Needs random IO to scale • Concurrency is tricky A log ﬁle system makes updates sequential, but relies on garbage collection (achilles heel!) • ~ log (2^30)/log 10000 = 3 IOs/update CoW B-tree [ZFS,WAFL,Btrfs,..] O(logB Nv) Update random IOs Range query O(Z/B) random (size Z) Space O(N B logB Nv)Nv = #keys live (accessible) at version vB = “block size”, say 1MB at 100 bytes/entry = 10000 entriescomplication: B is asymmetric for ﬂash.. • important for ﬂash ~ log (2^30)/log 10000 ~ log (2^30)/10000 = 3 IOs/update = 0.003 IOs/update CoW B-tree This talk [ZFS,WAFL,Btrfs,..] O(logB Nv) O((log Nv) / B) Update random IOs cache-oblivious IOs Range query O(Z/B) random O(Z/B) sequential (size Z) Space O(N B logB Nv) O(N)Nv = #keys live (accessible) at version vB = “block size”, say 1MB at 100 bytes/entry = 10000 entriescomplication: B is asymmetric for ﬂash.. • Unversioned Case[Doubling Array] • Doubling Array Inserts Buffer arrays in memoryuntil we have > B of them • Doubling Array Inserts 2 Buffer arrays in memoryuntil we have > B of them • Doubling Array Inserts 2 9 Buffer arrays in memoryuntil we have > B of them • Doubling Array Inserts 2 9 Buffer arrays in memoryuntil we have > B of them • Doubling Array Inserts 2 9 Buffer arrays in memoryuntil we have > B of them • Doubling Array Inserts2 9 • Doubling Array Inserts11 2 9 • Doubling Array Inserts11 2 98 • Doubling Array Inserts2 98 11 • Doubling Array Inserts2 98 11 • Doubling Array Inserts 2 8 9 11 etc...Similar to log-structured merge trees (LSM), cache-oblivious lookahead array (COLA), ...O(log N) “levels”, each element is rewritten once per level O((log N) / B) IOs • Doubling Array Queries • Doubling Array Queries• Add an index to each array to do lookups • Doubling Array Queries query(k)• Add an index to each array to do lookups• query(k) searches each array independently • Doubling Array Queries query(k)• Bloom Filters can help exclude arrays from search• ... but don’t help with range queries • Fractional Cascading • Fractional Cascading• Fractional Cascading: Use information from search at level l to help search at level l+1• From each array, sample every 4th element and put a pointer to it in previous level • Fractional Cascading found entry• Fractional Cascading: Use information from search at level l to help search at level l+1• From each array, sample every 4th element and put a pointer to it in previous level • Fractional Cascading found entry ‘forward pointers’ give bounds for search in next array• Fractional Cascading: Use information from search at level l to help search at level l+1• From each array, sample every 4th element and put a pointer to it in previous level • Fractional Cascading forward pointer data • Fractional Cascadingsearch • Fractional Cascadingsearch • Fractional Cascading search • Fractional Cascading search • Fractional Cascading search • Fractional Cascading • Fractional Cascading• In case you might get unlucky with the sampling... • Fractional Cascading• In case you might get unlucky with the sampling...• ... add regular ‘secondary’ pointers to nearest FP above and below • Versioned case (sketch) • Adding versions version 1k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13if layout is good for v1 ... v1 v2 • Adding versions version 1k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6 version 2if layout is good for v1 ... ... then it’s bad for v2 v1 v2 • Adding versions version 1 k1 k2 k3k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6 version 2 if layout is good for v1 ... ... then it’s bad for v2if you try to keep all versions of a key close... v1 k1 k2 k3 k4 k5 k6 k6 k7 k8 k9 k10 k11 k12 k13 v2 • Adding versions version 1 k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k6 version 2 if layout is good for v1 ... ... then it’s bad for v2if you try to keep all versions of a key close... k1 k2 k3 k4 k5 k6 k6 k6 k6 k6 ... k7 k8 k9 k10 k11 k12 k13 ... then it’s bad for all versions versions 2, 3, 4, ... • Density k0 k1 k2 k3 v0 v4 v0 v5 v4 v5 v1 v1 v2 v2 v3 v3 W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x• Arrays are tagged with a version set W • Density k0 k1 k2 k3 live at v1 v0 live(v1) = 4 live(v2) = 4 v4 live at v3 v0 live(v3) = 4 density = 4/8 v5 v4 v5 v1 v1 v2 v2 v3 v3 W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x• f(A,v) = (#elements in A live at version v) / |A|• density(A,W) = min{w in W} f(A,w) • Density k0 k1 k2 k3 live at v1 v0 live(v1) = 4 live(v2) = 4 v4 live at v3 v0 live(v3) = 4 density = 4/8 v5 v4 v5 v1 v1 v2 v2 v3 v3 W={v1,v2,v3} k0, v0, x k1, v0, x k1, v2, x k2, v1, x k2, v2, x k2, v3, x k3, v1, x k3, v2, x• f(A,v) = (#elements in A live at version v) / |A|• density(A,W) = min{w in W} f(A,w)• We say the array (A,W) is dense if density ≥1/5• Tradeoff: high density means good range queries, but many duplicates (imagine density 1 and density 1/N) • optimal bound of O(log Nv + Z/B). For much smaller rangequeries, the worst-case performance may be the same as fora point query. We now prove the amortized bound, which Range queriesapplies to smaller queries. Theorem 2. A range query at version v costs O(log Nv +Z/B) amortized I/Os. (k,*) Proof. We ﬁrst consider just point queries, and amortizethe cost of lookup(k, v) over all keys live at v. Let l(k, v) be •the cost of lookup(k, v), then the amortized cost is given by imagine scanning over each accessible array k l(k, v)/Nv . • density => trivially true for large (‘voluminous’) range queries •For anfor point queries: v, Ai ) be the number of I/Os used array Ai , let l(k,in examining elements in Ai for lookup(k,v). The idea is • amortize over all k for a ﬁxed version v • each query examines disjoint regions of the array • density implies total size examined = O(Nv log Nv) • Don’t worry, stay dense! • Version sets disjoint at each level -- lookups examine one array/level • merge arrays with intersecting version sets • the result of a merge might not be dense • Answer: density ampliﬁcation! promote merge density ampliﬁcation demote... ... {1,2} {2,3} {1,2,3} {1,3} {1} {4} {4} {4} • “density ampliﬁcation” k0 k1 k2 k3 live(v0) = 2 v0 density = 2/11 v4 live(v0) = 2 k0 k1 k2 k3 live(v5) = 4 split 1 v5v0 v1 density = 2/4v4 v2v5 v3v1v2v3 k0 k1 k2 k3 v0 live(v4) = 2 v0 split 2 v4 live(v1) = 3 live(v2) = 3 v5 live(v3) = 3 v4 v5 v1 v1 density = 2/7 split 1 v2 v3 v2 v3 split 2 • e- If (A, V ) also satisﬁes (L-live) then every split of it does (since all live elements are included), and likewise for (L- “density ampliﬁcation”r- h edge). It follows that version splitting (A , V ) – whichm necessarily has no promotable versions – results in a set of arrays all of which satisfy all of the L-* conditions necessary k0 k1 k2 k3 to stay atlive(v0) = 2 level l. v0 s, density = 2/11 v4 live(v0) = 2he The main result of k3 k0 k1 k2 this process is the following. live(v5) = 4 split 1 v5al v0 v1 density = 2/4 n v4ut Lemma 3 (Promotion). T he fraction of lead elements v5 v2e, over v1 l output arrays after a version split is ≥ 1/39. al v3 v2 v3 Proof. First, we claim that under k0 k1same conditions the k2 k3 st as the version split lemma, if in addition |A| < 2M live(v4) = 2 v0 and n split 2 live(v) >= M/3 for all v, then the number of output strata = 3 v0 v4 live(v1)re live(v2) = 3 is at most 13. Consider the arrays which obey the live(v3) = 3 v5 lead o v4 v5 fraction constraint. Each has sizev1at least M/3, since at v1ng least one version is split live in it, and least half of the array is= 2/7 1 v2 density d lead, sov2at least M/6 lead keys. The total number of lead v3 v3re keys in split 2 array A is ≤ 2M , since the array itself is no theui larger than this; it follows that there can be no more than • O n snapshot or clone of version v to new descendant ver- ou tpu tsion v , v is registered for each array A which is currently3.9 Update boundregistered to the parent of v. T his does not require any I / Os. Update T he th rays ca ting. Theorem 1. The stratiﬁed doubling array performs up-dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10amortized I/Os. For lar Z = Ω Proof. A ssume we have at our disposal a memory buﬀer properof size at least B (recall that B is not known to the algo- op timarithm). T hen each array that is involved in a disk merge querieshas size at least B , so a merge of some number of arrays of a pointotal size k elements costs O (k / B ) I / Os. In the C O L A [5], applieseach element exists in exactly one array and may participatein O (log N ) merges, which immediately gives the desiredamortized bound. In the scheme described here, elements Themay exist in many arrays, and elements may participate in Z/B) amany merges at the same level (eg when an array at levell is version split and some subarrays remain at level l after Prothe version split). N evertheless, we shall prove the theorem the cos • O n snapshot or clone of version v to new descendant ver- ou tpu tsion v , v is registered for each array A which is currently3.9 Update boundregistered to the parent of v. T his does not require any I / Os. Update T he th rays ca ting. Theorem 1. The stratiﬁed doubling array performs up-dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10amortized I/Os. For lar Z = Ω• Not possible to use basic amortized method (some elements in Proof.arrays; somehave at ourmerged many times) A ssume we elements disposal a memory buﬀer proper manyof size at least B (recall that B is not known to the algo- op tima•rithm). T hen each array of merges/splits to leaddisk merge only queries Idea: charge the cost that is involved in a elements • (k,v) appears as lead in of some array -> always N total leadpoinhas size at least B , so a merge exactly 1 number of arrays of atotal size k elements costs O (k / B ) I / Os. In the C O L A [5], applies •each element exists in exactly one array andpromotion each lead element receives$c/B on may participate •in O (log N ) merges, which immediately v / B) the desired total charge for version v is O(log N givesamortized bound. In the scheme described here, elements Themay exist in many arrays, and elements may participate in Z/B) amany merges at the same level (eg when an array at levell is version split and some subarrays remain at level l after Prothe version split). N evertheless, we shall prove the theorem the cos
• 9: return [split(r)] O n snapshot or clone of version v to new descendant ver- ou tpu t sion v , v is registered for each array A which is currently Update bound registered to the parent of v. T his does not require any I / Os.there is a version split of (A, V ), say (Ai , Vi ) for i = 1 . . . n,such that each array satisﬁes ( L-dense) and ( L-size) for level T he th rays cal, and Updateat most one index i for which lead(Ai ) < 3.9 there is ting.|AiTheorem 1. The stratiﬁed doubling array performs up- |/2. dates to a leaf version v in a cache-oblivious O (log N v / B ) 3.10 amortized I/Os. For larIf (A, V ) also satisﬁes (L-live) then every split of it does Z = Ω•(since all live elements basic amortized method (some elements in Not possible to use are included), and likewise for (L- Proof.arrays; somehave at ourmerged many times) A ssume we elements disposal a memory buﬀer proper manyedge). It follows that version splitting (A , V ) – which of size at least B (recall that B is not known to the algo- op tima•necessarily has no promotable versions – results in a set of only rithm). T hen each array of merges/splits to leaddisk merge Idea: charge the cost that is involved in a elementsarrays all of which satisfy all of the L-* conditions necessary queries • (k,v) appears as lead in of some array -> always N total leadpoin has size at least B , so a merge exactly 1 number of arrays ofto stay at level l. a total size k elements costs O (k / B ) I / Os. In the C O L A [5], applies • element exists in exactly one array andpromotion each lead element receives \$c/B on may eachmain result of this process is the following. participateThe • in O (log N ) merges, which immediately v / B) the desired total charge for version v is O(log N gives amortized bound. In the scheme described here, elements The may exist 3 many arrays, andTelements may lead elements in (Promotion). he fraction of participate in Z/B) a Lemmaover al merges at the same level (eg when is ≥array at level many l output arrays after a version split an 1/39. l is version split and some subarrays remain at level l after Pro the version split). N evertheless, we shall prove the theorem the cos
• Does it work?
• Insert rate, as a function of dictionary size 1e+06 100000Inserts per second 10000 1000 100 Stratified B-tree CoW B-tree 1 10 Keys (millions) ~3 OoM
• Range rate, as a function of dictionary size 1e+09 1e+08Reads per second 1e+07 1e+06 100000 Stratified B-tree CoW B-tree 10000 1 10 Keys (millions) ~1 OoM