Your SlideShare is downloading. ×
Data Structures and Algorithms for Big Databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Structures and Algorithms for Big Databases

2,225
views

Published on

Published in: Technology, Education

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,225
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Structures and Algorithms forBig DatabasesMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek
  • 2. Big data problemoy vey?????????data indexingquery processorqueries + answers???36542data ingestion2
  • 3. Big data problemoy vey?????????data indexingquery processorqueries + answers???36542data ingestionFor on-disk data, one sees funny tradeoffs in the speedsof data ingestion, query speed, and freshness of data.2
  • 4. Don’t Thrash: How to CacheYour Hash in Flashdata indexing query processorqueries +answers???42dataingestionFunny tradeoff in ingestion, querying, freshness• “Im trying to create indexes on a table with 308 million rows. It took ~20minutes to load the table but 10 days to build indexes on it.”‣ MySQL bug #9544• “Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are takingforever.”‣ Comment on mysqlperformanceblog.com• “They indexed their tables, and indexed them well,And lo, did the queries run quick!But that wasn’t the last of their troubles, to tell–Their insertions, like molasses, ran thick.”‣ Not from Alice in Wonderland by Lewis Carroll3
  • 5. Don’t Thrash: How to CacheYour Hash in Flashdata indexing query processorqueries +answers???42dataingestionFunny tradeoff in ingestion, querying, freshness• “Im trying to create indexes on a table with 308 million rows. It took ~20minutes to load the table but 10 days to build indexes on it.”‣ MySQL bug #9544• “Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are takingforever.”‣ Comment on mysqlperformanceblog.com• “They indexed their tables, and indexed them well,And lo, did the queries run quick!But that wasn’t the last of their troubles, to tell–Their insertions, like molasses, ran thick.”‣ Not from Alice in Wonderland by Lewis Carroll4
  • 6. Don’t Thrash: How to CacheYour Hash in Flashdata indexing query processorqueries +answers???42dataingestionFunny tradeoff in ingestion, querying, freshness• “Im trying to create indexes on a table with 308 million rows. It took ~20minutes to load the table but 10 days to build indexes on it.”‣ MySQL bug #9544• “Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are takingforever.”‣ Comment on mysqlperformanceblog.com• “They indexed their tables, and indexed them well,And lo, did the queries run quick!But that wasn’t the last of their troubles, to tell–Their insertions, like treacle, ran thick.”‣ Not from Alice in Wonderland by Lewis Carroll5
  • 7. This tutorial• Better datastructuressignificantly mitigatethe insert/query/freshness tradeoff.• These structuresscale to much largersizes while efficientlyusing the memory-hierarchy.Fractal-tree®indexLSMtreeBɛ-tree6
  • 8. Don’t Thrash: How to CacheYour Hash in FlashWhat we mean by Big DataWe don’t define Big Data in terms of TB, PB, EB.By Big Data, we mean• The data is too big to fit in main memory.• We need data structures on the data.• Words like “index” or “metadata” suggest that there areunderlying data structures.• These data structures are also too big to fit in mainmemory.7
  • 9. Don’t Thrash: How to CacheYour Hash in Flash8In this tutorial we study theunderlying data structures formanaging big data.File systemsNewSQLSQLNoSQL
  • 10. But enough aboutdatabases...... moreabout us.
  • 11. Don’t Thrash: How to CacheYour Hash in FlashTokutekA few years ago we started working together onI/O-efficient and cache-oblivious data structures.Along the way, we started Tokutek to commercializeour research.Michael Martin Bradley10
  • 12. Don’t Thrash: How to CacheYour Hash in FlashStorage engines in MySQLTokutek sells TokuDB, an ACID compliant,closed-source storage engine for MySQL.File SystemMySQL DatabaseSQL Processing,Query Optimization…Application11
  • 13. Don’t Thrash: How to CacheYour Hash in FlashStorage engines in MySQLTokutek sells TokuDB, an ACID compliant,closed-source storage engine for MySQL.File SystemMySQL DatabaseSQL Processing,Query Optimization…Application11
  • 14. Don’t Thrash: How to CacheYour Hash in FlashStorage engines in MySQLTokutek sells TokuDB, an ACID compliant,closed-source storage engine for MySQL.File SystemMySQL DatabaseSQL Processing,Query Optimization…ApplicationTokuDB also has a Berkeley DB API and canbe used independent of MySQL.11
  • 15. Don’t Thrash: How to CacheYour Hash in FlashStorage engines in MySQLTokutek sells TokuDB, an ACID compliant,closed-source storage engine for MySQL.File SystemMySQL DatabaseSQL Processing,Query Optimization…ApplicationTokuDB also has a Berkeley DB API and canbe used independent of MySQL.11Many of the data structures ideas in thistutorial were used in developing TokuDB.But this tutorial is about data structures andalgorithms, not TokuDB or any otherplatform.
  • 16. Don’t Thrash: How to CacheYour Hash in FlashOur Mindset• This tutorial is self contained.• We want to teach.• If something we say isn’t clear to you, please askquestions or ask us to clarify/repeat something.• You should be comfortable using math.• You should want to listen to data structures for anafternoon.12
  • 17. Don’t Thrash: How to CacheYour Hash in FlashTopics and Outline for this TutorialI/O model and cache-oblivious analysis.Write-optimized data structures.How write-optimized data structures can help filesystems.Block-replacement algorithms.Indexing strategies.Log-structured merge trees.Bloom filters.13
  • 18. Data Structures and Algorithms for Big DataModule 1: I/O Model and Cache-Oblivious AnalysisMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek
  • 19. I/O modelsStory for Module• If we want to understand the performance of datastructures within databases we need algorithmic modelsfor modeling I/Os.• There’s a long history of models for understanding thememory hierarchy. Many are beautiful. Most have notfound practical use.• Two approaches are very powerful.• That’s what we’ll present here so we have a foundationfor the rest of the tutorial.2
  • 20. I/O modelsHow computation works:• Data is transferred in blocks between RAM and disk.• The # of block transfers dominates the running time.Goal: Minimize # of block transfers• Performance bounds are parameterized byblock size B, memory size M, data size N.Modeling I/O Using the Disk Access ModelDiskRAMBBM3[Aggarwal+Vitter ’88]
  • 21. I/O modelsQuestion: How many I/Os to scan an array oflength N?Example: Scanning an Array4BN
  • 22. I/O modelsQuestion: How many I/Os to scan an array oflength N?Answer: O(N/B) I/Os.Example: Scanning an Array4BN
  • 23. I/O modelsQuestion: How many I/Os to scan an array oflength N?Answer: O(N/B) I/Os.Example: Scanning an Array4BNscan touches ≤ N/B+2 blocks
  • 24. I/O modelsExample: Searching in a B-treeQuestion: How many I/Os for a point query orinsert into a B-tree with N elements?O(logBN)5
  • 25. I/O modelsExample: Searching in a B-treeQuestion: How many I/Os for a point query orinsert into a B-tree with N elements?Answer:  O(logBN)5O(logB N)
  • 26. I/O modelsExample: Searching in an ArrayQuestion: How many I/Os to perform a binarysearch into an array of size N?6N/B blocks
  • 27. I/O modelsExample: Searching in an ArrayQuestion: How many I/Os to perform a binarysearch into an array of size N?Answer:6N/B blocksO✓log2NB◆⇡ O(log2 N)
  • 28. I/O modelsExample: Searching in an Array Versus B-treeMoral: B-tree searching is a factor of O(log2 B)faster than binary searching.7O(logBN)O(log2 N) O(logB N) = O✓log2 Nlog2 B◆
  • 29. I/O modelsExample: I/O-Efficient SortingImagine the following sorting problem:• 1000 MB data• 10 MB RAM• 1 MB Disk BlocksHere’s a sorting algorithm• Read in 10MB at a time, sort it, and write it out, producing100 10MB “runs”.• Merge 10 10MB runs together to make a 100MB run.Repeat 10x.• Merge 10 100MB runs together to make a 1000MB run.8
  • 30. I/O modelsI/O-Efficient Sorting in a Picture91000 MB unsorted data1000 MB sorted data100MB sorted runs10 MB sorted runs:sort 10MB runsmerge 10MB runs into 100MB runsmerge 100MB runs into 1000MB runs
  • 31. I/O modelsI/O-Efficient Sorting in a Picture101000 MB unsorted data1000 MB sorted data100MB sorted runs10 MB sorted runs:sort 10MB runsmerge 10MB runs into 100MB runsmerge 100MB runs into 1000MB runsWhy merge in two steps? We can only hold 10blocks in main memory.• 1000 MB data; 10 MB RAM;1 MB Disk Blocks
  • 32. I/O modelsMerge Sort in GeneralExample• Produce 10MB runs.• Merge 10 10MB runs for 100MB.• Merge 10 100MB runs for 1000MB.becomes in general:• Produce runs the size of main memory (size=M).• Construct a merge tree with fanout M/B, with runs at theleaves.• Repeatedly: pick a node that hasn’t been merged.Merge the M/B children together to produce a bigger run.11
  • 33. I/O modelsMerge Sort AnalysisQuestion: How many I/Os to sort N elements?• First run takes N/B I/Os.• Each level of the merge tree takes N/B I/Os.• How deep is the merge tree?12O✓NBlogM/BNB◆Cost to scan data # of scans of data
  • 34. I/O modelsMerge Sort AnalysisQuestion: How many I/Os to sort N elements?• First run takes N/B I/Os.• Each level of the merge tree takes N/B I/Os.• How deep is the merge tree?12O✓NBlogM/BNB◆Cost to scan data # of scans of dataThis bound is the best possible.
  • 35. I/O modelsMerge Sort as Divide-and-ConquerTo sort an array of N objects• If N fits in main memory, then just sort elements.• Otherwise,-- divide the array into M/B pieces;-- sort each piece (recursively); and-- merge the M/B pieces.This algorithm has the same I/O complexity.13Memory(size M)BB
  • 36. I/O modelsAnalysis of divide-and-conquerRecurrence relation:Solution:14T(N) =NBwhen N < MT(N) =MB· T✓NM/B◆+NB# of piecescost to sort eachpiece recursively cost to mergecost to sort somethingthat fits in memoryO✓NBlogM/BNB◆Cost to scan data # of scans of data
  • 37. I/O modelsIgnore CPU costsThe Disk Access Machine (DAM) model• ignores CPU costs and• assumes that all block accesses have the same cost.Is that a good performance model?15
  • 38. I/O modelsThe DAM Model is a Simplification16Disks are organized intotracks of different sizesFixed-size blocks are fetched.Tracks get prefetched into the diskcache, which holds ~100 tracks.
  • 39. I/O modelsThe DAM Model is a Simplification2kB or 4kB is too small for the model.• B-tree nodes in Berkeley DB & InnoDB have this size.• Issue: sequential block accesses run 10x faster thanrandom block accesses, which doesn’t fit the model.There is no single best block size.• The best node size for a B-tree depends on the operation(insert/delete/point query).17
  • 40. I/O modelsCache-oblivious analysis:• Parameters B, M are unknown to the algorithm or coder.• Performance bounds are parameterized by block size B,memory size M, data size N.Goal (as before): Minimize # of block transferCache-Oblivious AnalysisDiskRAMB=??B=??M=??18[Frigo, Leiserson, Prokop, Ramachandran ’99]
  • 41. I/O models• Cache-oblivious algorithms work for all B and M...• ... and all levels of a multi-level hierarchy.It’s better to optimize approximately for all B, Mthan to pick the best B and M.Cache-Oblivious ModelDiskRAMB=??B=??M=??19[Frigo, Leiserson, Prokop, Ramachandran ’99]
  • 42. I/O modelsB-trees, k-way Merge Sort Aren’t Cache-ObliviousSurprisingly, there are cache-oblivious B-treesand cache-oblivious sorting algorithms.20BMBFan-out is afunction of B.Fan-in is a functionof M and B.[Frigo, Leiserson, Prokop, Ramachandran ’99] [Bender, Demaine, Farach-Colton ’00][Bender, Duan, Iacono,Wu ’02] [Brodal, Fagerberg, Jacob ’02] [Brodal, Fagerberg,Vinther ’04]
  • 43. I/O modelsB Small Big4K 17.3ms 22.4ms16K 13.9ms 22.1ms32K 11.9ms 17.4ms64K 12.9ms 17.6ms128K 13.2ms 16.5ms256K 18.5ms 14.4ms512K 16.7msTime for 1000 Random SearchesThere’s no best blocksize.The optimal block sizefor inserts is verydifferent.Small BigCO B-tree 12.3ms 13.8ms[Bender, Farach-Colton,Kuszmaul ’06]
  • 44. I/O modelsSummaryAlgorithmic models of the memory hierarchyexplain how DB data structures scale.• There’s a long history of models of the memory hierarchy.Many are beautiful. Most haven’t seen practical use.DAM and cache-oblivious analysis are powerful• Parameterized by block size B and memory size M.• In the CO model, B and M are unknown to the coder.22
  • 45. Data Structures and Algorithms for Big DataModule 2: Write-Optimized DataStructuresMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek
  • 46. Big data problemoy vey?????????data indexingquery processorqueries + answers???36542data ingestion2
  • 47. Big data problemoy vey?????????data indexingquery processorqueries + answers???36542data ingestionFor on-disk data, one sees funny tradeoffs in the speedsof data ingestion, query speed, and freshness of data.2
  • 48. Don’t Thrash: How to CacheYour Hash in Flashdata indexing query processorqueries +answers???42dataingestionFunny tradeoff in ingestion, querying, freshness• “Im trying to create indexes on a table with 308 million rows. It took ~20minutes to load the table but 10 days to build indexes on it.”‣ MySQL bug #9544• “Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are takingforever.”‣ Comment on mysqlperformanceblog.com• “They indexed their tables, and indexed them well,And lo, did the queries run quick!But that wasn’t the last of their troubles, to tell–Their insertions, like molasses, ran thick.”‣ Not from Alice in Wonderland by Lewis Carroll3
  • 49. Don’t Thrash: How to CacheYour Hash in Flashdata indexing query processorqueries +answers???42dataingestionFunny tradeoff in ingestion, querying, freshness• “Im trying to create indexes on a table with 308 million rows. It took ~20minutes to load the table but 10 days to build indexes on it.”‣ MySQL bug #9544• “Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are takingforever.”‣ Comment on mysqlperformanceblog.com• “They indexed their tables, and indexed them well,And lo, did the queries run quick!But that wasn’t the last of their troubles, to tell–Their insertions, like molasses, ran thick.”‣ Not from Alice in Wonderland by Lewis Carroll4
  • 50. Don’t Thrash: How to CacheYour Hash in Flashdata indexing query processorqueries +answers???42dataingestionFunny tradeoff in ingestion, querying, freshness• “Im trying to create indexes on a table with 308 million rows. It took ~20minutes to load the table but 10 days to build indexes on it.”‣ MySQL bug #9544• “Select queries were slow until I added an index onto the timestamp field...Adding the index really helped our reporting, BUT now the inserts are takingforever.”‣ Comment on mysqlperformanceblog.com• “They indexed their tables, and indexed them well,And lo, did the queries run quick!But that wasn’t the last of their troubles, to tell–Their insertions, like treacle, ran thick.”‣ Not from Alice in Wonderland by Lewis Carroll5
  • 51. NSF Workshop on Research Directions in Principles of Parallel ComputingThis module• Write-optimizedstructuressignificantly mitigatethe insert/query/freshness tradeoff.• One can insert10x-100x faster thanB-trees whileachieving similarpoint queryperformance.Fractal-tree®indexLSMtreeBɛ-tree6
  • 52. Don’t Thrash: How to CacheYour Hash in FlashHow computation works:• Data is transferred in blocks between RAM and disk.• The number of block transfers dominates the running time.Goal: Minimize # of block transfers• Performance bounds are parameterized byblock size B, memory size M, data size N.An algorithmic performance modelDiskRAMBBM[Aggarwal+Vitter ’88]7
  • 53. Don’t Thrash: How to CacheYour Hash in FlashAn algorithmic performance modelB-tree point queries: O(logB N) I/Os.Binary search in array: O(log N/B)≈O(log N) I/Os.Slower by a factor of O(log B)O(logBN)8
  • 54. Don’t Thrash: How to CacheYour Hash in FlashWrite-optimized data structures performance• If B=1024, then insert speedup is B/logB≈100.• Hardware trends mean bigger B, bigger speedup.• Less than 1 I/O per insert.B-treeSome write-optimizedstructuresInsert/delete O(logBN)=O( ) O( )logNlogBlogNBData structures: [ONeil,Cheng, Gawlick, ONeil 96], [Buchsbaum, Goldwasser, Venkatasubramanian, Westbrook 00],[Argel 03], [Graefe 03], [Brodal, Fagerberg 03], [Bender, Farach,Fineman,Fogel, Kuszmaul, Nelson’07], [Brodal,Demaine, Fineman, Iacono, Langerman, Munro 10], [Spillane, Shetty, Zadok, Archak, Dixit 11].Systems: BigTable, Cassandra, H-Base, LevelDB, TokuDB.9
  • 55. Don’t Thrash: How to CacheYour Hash in FlashOptimal Search-Insert Tradeoff [Brodal, Fagerberg 03]insert point queryOptimaltradeoff(function of ɛ=0...1)B-tree(ɛ=1)O✓logB NpB◆O (logB N)O (logB N)ɛ=1/2O✓log NB◆O (log N)ɛ=0O log1+B" NO✓log1+B" NB1 "◆O (logB N)10x-100xfasterinserts10
  • 56. Don’t Thrash: How to CacheYour Hash in FlashIllustration of Optimal Tradeoff [Brodal, Fagerberg 03]InsertsPointQueriesFastSlowSlowFastB-treeLoggingOptimal Curve11
  • 57. Don’t Thrash: How to CacheYour Hash in FlashIllustration of Optimal Tradeoff [Brodal, Fagerberg 03]InsertsPointQueriesFastSlowSlowFastB-treeLoggingOptimal CurveInsertions improve by10x-100x withalmost no loss of point-query performanceTarget of opportunity12
  • 58. Don’t Thrash: How to CacheYour Hash in FlashIllustration of Optimal Tradeoff [Brodal, Fagerberg 03]InsertsPointQueriesFastSlowSlowFastB-treeLoggingOptimal CurveInsertions improve by10x-100x withalmost no loss of point-query performanceTarget of opportunity12
  • 59. One way to Build Write-Optimized Structures(Other approaches later)
  • 60. Don’t Thrash: How to CacheYour Hash in FlashA simple write-optimized structureO(log N) queries and O((log N)/B) inserts:• A balanced binary tree with buffers of size BInserts + deletes:• Send insert/delete messages down from the root and storethem in buffers.• When a buffer fills up, flush.14
  • 61. Don’t Thrash: How to CacheYour Hash in FlashA simple write-optimized structureO(log N) queries and O((log N)/B) inserts:• A balanced binary tree with buffers of size BInserts + deletes:• Send insert/delete messages down from the root and storethem in buffers.• When a buffer fills up, flush.14
  • 62. Don’t Thrash: How to CacheYour Hash in FlashA simple write-optimized structureO(log N) queries and O((log N)/B) inserts:• A balanced binary tree with buffers of size BInserts + deletes:• Send insert/delete messages down from the root and storethem in buffers.• When a buffer fills up, flush.14
  • 63. Don’t Thrash: How to CacheYour Hash in FlashA simple write-optimized structureO(log N) queries and O((log N)/B) inserts:• A balanced binary tree with buffers of size BInserts + deletes:• Send insert/delete messages down from the root and storethem in buffers.• When a buffer fills up, flush.14
  • 64. Don’t Thrash: How to CacheYour Hash in FlashA simple write-optimized structureO(log N) queries and O((log N)/B) inserts:• A balanced binary tree with buffers of size BInserts + deletes:• Send insert/delete messages down from the root and storethem in buffers.• When a buffer fills up, flush.14
  • 65. Don’t Thrash: How to CacheYour Hash in FlashA simple write-optimized structureO(log N) queries and O((log N)/B) inserts:• A balanced binary tree with buffers of size BInserts + deletes:• Send insert/delete messages down from the root and storethem in buffers.• When a buffer fills up, flush.15
  • 66. Don’t Thrash: How to CacheYour Hash in FlashAnalysis of writesAn insert/delete costs amortized O((log N)/B) perinsert or delete• A buffer flush costs O(1) & sends B elements down onelevel• It costs O(1/B) to send element down one level of the tree.• There are O(log N) levels in a tree.16
  • 67. Difficulty of Key Accesses
  • 68. Difficulty of Key Accesses
  • 69. Don’t Thrash: How to CacheYour Hash in FlashAnalysis of point queriesTo search:• examine each buffer along a single root-to-leaf path.• This costs O(log N).18
  • 70. Don’t Thrash: How to CacheYour Hash in FlashObtaining optimal point queries + very fast insertsPoint queries cost O(log√B N)= O(logB N)• This is the tree height.Inserts cost O((logBN)/√B)• Each flush cost O(1) I/Os and flushes √B elements.√BB...fanout: √B19
  • 71. Don’t Thrash: How to CacheYour Hash in FlashCache-oblivious write-optimized structuresYou can even make these data structurescache-oblivious.This means that the data structure can be madeplatform independent (no knobs), i.e., workssimultaneously for all values of B and M.[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson, SPAA 07][Brodal, Demaine, Fineman, Iacono, Langerman, Munro, SODA 10]Randomaccesses areexpensive.You can be cache- and I/O-efficient with noknobs or other memory-hierarchyparameterization.
  • 72. Don’t Thrash: How to CacheYour Hash in FlashCache-oblivious write-optimized structuresYou can even make these data structurescache-oblivious.This means that the data structure can be madeplatform independent (no knobs), i.e., workssimultaneously for all values of B and M.[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson, SPAA 07][Brodal, Demaine, Fineman, Iacono, Langerman, Munro, SODA 10]Randomaccesses areexpensive.You can be cache- and I/O-efficient with noknobs or other memory-hierarchyparameterization.
  • 73. Don’t Thrash: How to CacheYour Hash in FlashWhat the world looks likeInsert/point query asymmetry• Inserts can be fast: >50K high-entropy writes/sec/disk.• Point queries are necessarily slow: <200 high-entropy reads/sec/disk.We are used to reads and writes having about thesame cost, but writing is easier than reading.Reading is hard.Writing is easier.21
  • 74. Don’t Thrash: How to CacheYour Hash in FlashThe right read-optimization is write-optimizationThe right index makes queries run fast.• Write-optimized structures maintain indexes efficiently.data indexingquery processorqueries???42answersdataingestion22
  • 75. Don’t Thrash: How to CacheYour Hash in FlashThe right read-optimization is write-optimizationThe right index makes queries run fast.• Write-optimized structures maintain indexes efficiently.Fast writing is a currency we use to acceleratequeries. Better indexing means faster queries.data indexingquery processorqueries???42answersdataingestion22
  • 76. Don’t Thrash: How to CacheYour Hash in FlashThe right read-optimization is write-optimizationI/OLoadAdd selective indexes.(We can now afford to maintain them.)23
  • 77. Don’t Thrash: How to CacheYour Hash in FlashThe right read-optimization is write-optimizationI/OLoadAdd selective indexes.(We can now afford to maintain them.)Write-optimized structures can significantlymitigate the insert/query/freshness tradeoff.323
  • 78. Implementation Issues
  • 79. Don’t Thrash: How to CacheYour Hash in FlashWrite optimization. ✔ What’s missing?Optimal read-write tradeoff: EasyFull featured: Hard• Variable-sized rows• Concurrency-control mechanisms• Multithreading• Transactions, logging, ACID-compliant crash recovery• Optimizations for the special cases of sequential insertsand bulk loads• Compression• Backup25
  • 80. Don’t Thrash: How to CacheYour Hash in FlashSystems often assume search cost = insert costSome inserts/deletes have hidden searches.Example:• return error when a duplicate key is inserted.• return # elements removed on a delete.These “cryptosearches” throttle insertionsdown to the performance of B-trees.26
  • 81. Don’t Thrash: How to CacheYour Hash in FlashCryptosearches in uniqueness checkingUniqueness checking has a hidden search:In a B-tree uniqueness checking comes for free• On insert, you fetch a leaf.• Checking if key exists is no biggie.If Search(key) == TrueReturn Error;ElseFast_Insert(key,value);
  • 82. Don’t Thrash: How to CacheYour Hash in FlashCryptosearches in uniqueness checkingUniqueness checking has a hidden search:In a write-optimized structure, that crypto-search can throttle performance• Insertion messages are injected.• These eventually get to “bottom” of structure.• Insertion w/Uniqueness Checking 100x slower.• Bloom filters, Cascade Filters, etc help.If Search(key) == TrueReturn Error;ElseFast_Insert(key,value);[Bender, Farach-Colton, Johnson, Kraner, Kuszmaul, Medjedovic, Montes, Shetty, Spillane, Zadok 12]28
  • 83. Don’t Thrash: How to CacheYour Hash in FlashA simple implementation of pessimistic locking: maintain locksin leaves• Insert row t• Search for row u• Search for row v and put a cursor• Increment cursor. Now cursor points to row w.This scheme is inefficient for write-optimized structuresbecause there are cryptosearches on writes.29v wtwriter lockureader lock reader range lockA locking scheme with cryptosearches
  • 84. Performance
  • 85. Don’t Thrash: How to CacheYour Hash in FlashiiBench Insertion Benchmark31nsertion of 1 billion rows into a table while maintaining three multicolumn secondt the end of the test, TokuDB’s insertion rate remained at 17,028 inserts/secondd dropped to 1,050 inserts/second. That’s a difference of over 16x.Ubuntu 10.10; 2x Xeon X5460; 16GB RAM; 8x 146GB 10k SAS in RAID10.
  • 86. Don’t Thrash: How to CacheYour Hash in FlashCompression32
  • 87. Don’t Thrash: How to CacheYour Hash in FlashiiBench on SSDTokuDB on rotating disk beats InnoDB on SSD.33050001000015000200002500030000350000 5e+07 1e+08 1.5e+08InsertionRateCummulative InsertionsRAID10X25-EFusionIOInnoDBTokuDBRAID10X25EFusionIO
  • 88. Don’t Thrash: How to CacheYour Hash in FlashWrite-optimization Can Help Schema Changes34wntime is seconds to minutes. We detailed an experiment that showed this in0 also introduced Hot Indexing. You can add an index to an existing table withThe total downtime is seconds to a few minutes, because when the index is finL closes and reopens the table. This means that the downtime occurs not whenissued, but later on. Still, it is quite minimal, as we showed in this blog.ntOS 5.5; 2x Xeon E5310; 4GB RAM; 4x 1TB 7.2k SATA in RAID0.– Software Configuration Details back to top
  • 89. Don’t Thrash: How to CacheYour Hash in FlashMongoDB with Fractal-Tree Index35!
  • 90. Scaling into the Future
  • 91. Don’t Thrash: How to CacheYour Hash in FlashWrite-optimization going forwardExample: Time to fill a disk in 1973, 2010, 2022.• log high-entropy data sequentially versus index data inB-tree.Better data structures may be a luxury now, butthey will be essential by the decade’s end.Year Size Bandwidth Access TimeTime to logdata on diskTime to fill diskusing a B-tree(row size 1K)1973 35MB 835KB/s 25ms 39s 975s2010 3TB 150MB/s 10ms 5.5h 347d2022 220TB 1.05GB/s 10ms 2.4d 70y
  • 92. Don’t Thrash: How to CacheYour Hash in FlashWrite-optimization going forwardExample: Time to fill a disk in 1973, 2010, 2022.• log high-entropy data sequentially versus index data inB-tree.Better data structures may be a luxury now, butthey will be essential by the decade’s end.Year Size BandwidthAccessTimeTime to logdata on diskTime to fill diskusing a B-tree(row size 1K)Time to fill usingFractal tree*(row size 1K)1973 35MB 835KB/s 25ms 39s 975s2010 3TB 150MB/s 10ms 5.5h 347d2022 220TB 1.05GB/s 10ms 2.4d 70y* Projected times for fully multi-threaded version38
  • 93. Don’t Thrash: How to CacheYour Hash in FlashWrite-optimization going forwardExample: Time to fill a disk in 1973, 2010, 2022.• log high-entropy data sequentially versus index data inB-tree.Better data structures may be a luxury now, butthey will be essential by the decade’s end.Year Size BandwidthAccessTimeTime to logdata on diskTime to fill diskusing a B-tree(row size 1K)Time to fill usingFractal tree*(row size 1K)1973 35MB 835KB/s 25ms 39s 975s 200s2010 3TB 150MB/s 10ms 5.5h 347d 36h2022 220TB 1.05GB/s 10ms 2.4d 70y 23.3d* Projected times for fully multi-threaded version39
  • 94. Don’t Thrash: How to CacheYour Hash in FlashSummary of ModuleWrite-optimization can solve many problems.• There is a provable point-query insert tradeoff. We caninsert 10x-100x faster without hurting point queries.• We can avoid much of the funny tradeoff between dataingestion, freshness, and query speed.• We can avoid tuning knobs.write-optimized
  • 95. Data Structures and Algorithms for Big DataModule 3: (Case Study)TokuFS--How to Make a Write-Optimized File SystemMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek
  • 96. Don’t Thrash: How to CacheYour Hash in FlashStory for ModuleAlgorithms for Big Data apply to all storagesystems, not just databases.Some big-data users store use a file system.The problem with Big Data is Microdata...2
  • 97. HEC FSIO Grand ChallengesStore 1 trillion filesCreate tens of thousands of filesper secondTraverse directory hierarchiesfast (ls -R)B-trees would require at leasthundreds of disk drives.
  • 98. Don’t Thrash: How to CacheYour Hash in FlashTokuFSTokuFS• A file-system prototype• >20K file creates/sec• very fast ls -R• HEC grand challenges on a cheap disk(except 1 trillion files)[Esmet, Bender, Farach-Colton, Kuszmaul HotStorage12]TokuFSTokuDBXFS
  • 99. Don’t Thrash: How to CacheYour Hash in FlashTokuFSTokuFS• A file-system prototype• >20K file creates/sec• very fast ls -R• HEC grand challenges on a cheap disk(except 1 trillion files)• TokuFS offers orders-of-magnitude speedupon microdata workloads.‣ Aggregates microwrites while indexing.‣ So it can be faster than the underlying file system.[Esmet, Bender, Farach-Colton, Kuszmaul HotStorage12]TokuFSTokuDBXFS
  • 100. Don’t Thrash: How to CacheYour Hash in FlashBig speedups on microwritesWe ran microdata-intensive benchmarks• Compared TokuFS to ext4, XFS, Btrfs, ZFS.• Stressed metadata and file data.• Used commodity hardware:‣2 core AMD, 4GB RAM‣Single 7200 RPM disk‣Simple, cheap setup. No hardware tricks.• In all tests, we observed orders of magnitude speed up.6
  • 101. Don’t Thrash: How to CacheYour Hash in FlashCreate 2 million 200-byte files in a shallow treeFaster on small file creation7
  • 102. Don’t Thrash: How to CacheYour Hash in FlashCreate 2 million 200-byte files in a shallow treeFaster on small file creationLogscale7
  • 103. Don’t Thrash: How to CacheYour Hash in FlashFaster on metadata scanRecursively scan directory tree for metadata• Use the same 2 million files created before.• Start on a cold cache to measure disk I/O efficiency8
  • 104. Don’t Thrash: How to CacheYour Hash in FlashFaster on big directoriesCreate one million empty files in a directory• Create files with random names, then read them back.• Tests how well a single directory scales.9
  • 105. Don’t Thrash: How to CacheYour Hash in FlashFaster on microwrites in a big fileRandomly write out a file in small, unalignedpieces10
  • 106. TokuFSImplementation
  • 107. Don’t Thrash: How to CacheYour Hash in FlashTokuFS employs two indexesMetadata index:• The metadata index maps pathname to file metadata.‣/home/esmet ⟹ mode, file size, access times, ...‣/home/esmet/tokufs.c ⟹ mode, file size, access times, ...Data index:• The data index maps pathname, blocknum to bytes.‣/home/esmet/tokufs.c, 0 ⟹ [ block of bytes ]‣/home/esmet/tokufs.c, 1 ⟹ [ block of bytes ]• Block size is a compile-time constant: 512.‣ good performance on small files, moderate on large files12
  • 108. Don’t Thrash: How to CacheYour Hash in FlashCommon queries exhibit localityMetadata index keys: full path as string• All the children of a directory are contiguous in theindex• Reading a directory is simple and fastData block index keys:【full path, blocknum】• So all the blocks for a file are contiguous in the index• Reading a file is simple and fast13
  • 109. Don’t Thrash: How to CacheYour Hash in FlashTokuFS compresses indexesReduces overhead from full path keys• Pathnames are highly “prefix redundant”• They compress very, very well in practiceReduces overhead from zero-valued padding• Uninitialized bytes in a block are set to zero• Good portions of the metadata struct are set to zeroCompression between 7-15x on real data• For example, a full MySQL source tree14
  • 110. Don’t Thrash: How to CacheYour Hash in FlashTokuFS is fully functionalTokuFS is a prototype, but fully functional.• Implements files, directories, metadata, etc.• Interfaces with applications via shared library, header.We wrote a FUSE implementation, too.• FUSE lets you implement filesystems in user space.• But there’s overhead, so performance isn’t optimal.• The best way to run is through our POSIX-like file API.15
  • 111. Microdata is the Problem
  • 112. Data Structures and Algorithms for Big DataModule 4: PagingMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek
  • 113. This Module2The algorithmics ofcache-management.This will help usunderstand I/O- andcache-efficientalgorithms.
  • 114. Goal: minimize # block transfers.• Data is transferred in blocks between RAM and disk.• Performance bounds are parameterized by B, M, N.When a block is cached, the access cost is 0.Otherwise it’s 1.Recall Disk Access ModelDiskRAMBM3[Aggarwal+Vitter ’88]
  • 115. Disk Access Model (DAM Model):• Performance bounds are parameterized by B, M, N.Goal: Minimize # of block transfers.Beautiful restriction:• Parameters B, M are unknown to the algorithm or coder.Recall Cache-Oblivious AnalysisDiskRAMB=??M=??4[Frigo, Leiserson, Prokop, Ramachandran ’99]
  • 116. CO analysis applies to unknown multilevel hierarchies:• Cache-oblivious algorithms work for all B and M...• ... and all levels of a multi-level hierarchy.Moral:• It’s better to optimize approximately for all B, M rather than to tryto pick the best B and M.Recall Cache-Oblivious AnalysisDiskRAMB=??M=??5[Frigo, Leiserson, Prokop, Ramachandran ’99]
  • 117. Cache-Replacement in Cache-Oblivious AlgorithmsWhich blocks are currently cached in RAM?• The system performs its own caching/paging.• If we knew B and M we could explicitly manage I/O.(But even then, what should we do?)6DiskRAMB=??M=??
  • 118. Cache-Replacement in Cache-Oblivious AlgorithmsWhich blocks are currently cached in RAM?• The system performs its own caching/paging.• If we knew B and M we could explicitly manage I/O.(But even then, what should we do?)But systems may use different mechanisms, so whatcan we actually assume?6DiskRAMB=??M=??
  • 119. This Module: Cache-Management StrategiesWith cache-oblivious analysis, we can assumea memory system with optimal replacement.Even though the system manages memory, wecan assume all the advantages of explicitmemory management.7DiskRAMB=??M=??
  • 120. This Module: Cache-Management StrategiesAn LRU-based system with memory M performs cache-management< 2x worse than the optimal, prescient policy with memory M/2.Achieving optimal cache-management is hard because predictingthe future is hard.But LRU with (1+ɛ)M memory is almost as good (or better), than theoptimal strategy with M memory.8DiskOPTM[Sleator, Tarjan 85]DiskLRU(1+ɛ) MLRU with (1+ɛ) more memory isnearly as good or better...... than OPT.
  • 121. The paging/caching problemA program is just sequence of block requests:Cost of request rjAlgorithmic question:• Which block should be ejected when block rj isbrought into cache?9r1, r2, r3, . . .cost(rj) =⇢0 block rj is already cached,1 block rj is brought into cache.
  • 122. The paging/caching problemRAM holds only k=M/B blocks.Which block should be ejected when block rj isbrought into cache?10DiskRAMMrj???
  • 123. Paging AlgorithmsLRU (least recently used)• Discard block whose most recent access is earliest.FIFO (first in, first out)• Discard the block brought in longest ago.LFU (least frequently used)• Discard the least popular block.Random• Discard a random block.LFD (longest forward distance)=OPT• Discard block whose next access is farthest in the future.11[Belady 69]
  • 124. LFD (Longest Forward Distance) [Belady ’69]:• Discard the block requested farthest in the future.Cons: Who knows the Future?!Pros: LFD can be viewed as a point ofcomparison with online strategies.12Page 5348 shall berequested tomorrowat 2:00 pmOptimal Page Replacement
  • 125. LFD (Longest Forward Distance) [Belady ’69]:• Discard the block requested farthest in the future.Cons: Who knows the Future?!Pros: LFD can be viewed as a point ofcomparison with online strategies.13Page 5348 shall berequested tomorrowat 2:00 pmOptimal Page Replacement
  • 126. LFD (Longest Forward Distance) [Belady ’69]:• Discard the block requested farthest in the future.Cons: Who knows the Future?!Pros: LFD can be viewed as a point ofcomparison with online strategies.14Page 5348 shall berequested tomorrowat 2:00 pmOptimal Page Replacement
  • 127. Competitive AnalysisAn online algorithm A is k-competitive, if forevery request sequence R:Idea of competitive analysis:• The optimal (prescient) algorithm is a yardstick we useto compare online algorithms.15costA(R)  k costopt(R)
  • 128. LRU is no better than k-competitiveMemory holds 3 blocksThe program accesses 4 different blocksThe request stream is16MM/B = k = 3rj 2 {1, 2, 3, 4}1, 2, 3, 4, 1, 2, 3, 4, · · ·
  • 129. LRU is no better than k-competitive17requestsblocks inmemoryThere’s a block transfer at every step becauseLRU ejects the block that’s requested in the next step.1 2 3 4 1 2 3 4 1 21 1 1 1 1 1 1 12 2 2 2 2 2 23 3 3 3 3 34 4 4 4 4 4
  • 130. LRU is no better than k-competitive18requestsblocks inmemoryLFD (longest forward distance) has ablock transfer every k=3 steps.1 2 3 4 1 2 3 4 1 21 1 1 1 1 1 1 1 1 12 2 2 2 2 23 3 3 3 34 4 4 4 4 4 4
  • 131. LRU is k-competitiveIn fact, LRU is k=M/B-competitive.• I.e., LRU has k=M/B times more transfers than OPT.• A depressing result because k is huge so k . OPT isnothing to write home about.LFU and FIFO are also k-competitive.• This is a depressing result because FIFO is empiricallyworse than LRU, and this isn’t captured in the math.19[Sleator, Tarjan 85]
  • 132. On the other hand, the LRU bad example is fragile20requestsblocks inmemoryIf k=M/B=4, not 3, then both LRU and OPT do well.If k=M/B=2, not 3, then neither LRU nor OPT does well.1 2 3 4 1 2 3 4 1 21 1 1 1 1 1 1 12 2 2 2 2 2 23 3 3 3 3 34 4 4 4 4 4
  • 133. LRU is 2-competitive with more memoryLRU is at most twice as bad as OPT, when LRUhas twice the memory.In general, LRU is nearly as good as OPT whenLRU has a little more memory than OPT.21LRU|cache|=k(R)  2 OPT|cache|=k/2(R)[Sleator, Tarjan 85]
  • 134. LRU is 2-competitive with more memoryLRU is at most twice as bad as OPT, when LRUhas twice the memory.In general, LRU is nearly as good as OPT whenLRU has a little more memory than OPT.21LRU|cache|=k(R)  2 OPT|cache|=k/2(R)[Sleator, Tarjan 85]LRU has more memory, but OPT=LFD can see the future.
  • 135. LRU is 2-competitive with more memoryLRU is at most twice as bad as OPT, when LRUhas twice the memory.In general, LRU is nearly as good as OPT whenLRU has a little more memory than OPT.21LRU|cache|=k(R)  2 OPT|cache|=k/2(R)[Sleator, Tarjan 85]LRU has more memory, but OPT=LFD can see the future.
  • 136. LRU is 2-competitive with more memoryLRU is at most twice as bad as OPT, when LRUhas twice the memory.In general, LRU is nearly as good as OPT whenLRU has a little more memory than OPT.21LRU|cache|=k(R)  2 OPT|cache|=k/2(R)[Sleator, Tarjan 85](These bounds don’t apply to FIFO, distinguishing LRU from FIFO).LRU has more memory, but OPT=LFD can see the future.
  • 137. Divide LRU into phases, each with k faults.LRU Performance Proof22r1, r2, . . . , ri, ri+1, . . . , rj, rj+1, . . . , r`, r`+1, . . .
  • 138. Divide LRU into phases, each with k faults.OPT[k] must have ≥ 1 fault in each phase.• Case analysis proof.• LRU is k-competitive.LRU Performance Proof22r1, r2, . . . , ri, ri+1, . . . , rj, rj+1, . . . , r`, r`+1, . . .
  • 139. Divide LRU into phases, each with k faults.OPT[k] must have ≥ 1 fault in each phase.• Case analysis proof.• LRU is k-competitive.OPT[k/2] must have ≥ k/2 faults in each phase.• Main idea: each phase must touch k different pages.• LRU is 2-competitive.LRU Performance Proof22r1, r2, . . . , ri, ri+1, . . . , rj, rj+1, . . . , r`, r`+1, . . .
  • 140. Under the hood of cache-oblivious analysisMoral: with cache-oblivious analysis, we cananalyze based on a memory system with optimal,omniscient replacement.• Technically, an optimal cache-oblivious algorithm isasymptotically optimal versus any algorithm on a memorysystem that is slightly smaller.• Empirically, this is just a technicality.23DiskOPTMDiskLRU(1+ɛ) MThis is almost as good or better... ... than this.
  • 141. Ramifications for New Cache-Replacement PoliciesMoral: There’s not much performance on thetable for new cache-replacement policies.• Bad instances for LRU versus LFD are fragile and verysensitive to k=M/B.There are still research questions:• What if blocks have different sizes [Irani 02][Young 02]?• There’s a write-back cost? (Complexity unknown.)• LRU may be too costly to implement (clock algorithm).24
  • 142. Data Structures and Algorithms for Big DataModule 5: What to IndexMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek
  • 143. Don’t Thrash: How to CacheYour Hash in FlashStory of this moduleThis module explores indexing.Traditionally, (with B-trees), indexing speedsqueries, but cripples insert.But now we know that maintaining indexes ischeap. So what should you index?2
  • 144. Don’t Thrash: How to CacheYour Hash in FlashAn Indexing TestimonialThis is a graph from a real user, who addedsome indexes, and reduced the I/O load ontheir server. (They couldn’t maintain theindexes with B-trees.)3I/OLoadAdd selective indexes.
  • 145. What is an Index?To understand what to index, we need to get onthe same page for what an index is.
  • 146. Row, Index, and TableRow• Key,value pair• key = a, value = b,cIndex• Ordering of rows by key(dictionary)• Used to make queries fastTable• Set of indexesa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45create table foo (a int, b int, c int,primary key(a));
  • 147. An index is a dictionaryDictionary API: maintain a set S subject to• insert(x): S ← S ∪ {x}• delete(x): S ← S - {x}• search(x): is x ∊ S?• successor(x): return min y > x s.t. y ∊ S• predecessor(y): return max y < x s.t. y ∊ SWe assume that these operations perform aswell as a B-tree. For example, the successoroperation usually doesn’t require an I/O.
  • 148. A table is a set of indexesA table is a set of indexes with operations:• Add index: add key(f1,f2,...);• Drop index: drop key(f1,f2,...);• Add column: adds a field to primary key value.• Remove column: removes a field and drops all indexeswhere field is part of key.• Change field type• ...Subject to index correctness constraints.We want table operations to be fast too.
  • 149. Next: how to use indexes to improve queries.
  • 150. Indexes provide query performance1. Indexes can reduce the amount of retrieveddata.• Less bandwidth, less processing, ...2. Indexes can improve locality.• Not all data access cost is the same• Sequential access is MUCH faster than random access3. Indexes can presort data.• GROUP BY and ORDER BY queries do post-retrievalwork• Indexing can help get rid of this work
  • 151. Indexes provide query performance1. Indexes can reduce the amount of retrieveddata.• Less bandwidth, less processing, ...2. Indexes can improve locality.• Not all data access cost is the same• Sequential access is MUCH faster than random access3. Indexes can presort data.• GROUP BY and ORDER BY queries do post-retrievalwork• Indexing can help get rid of this work
  • 152. a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45An index can select needed rowscount (*) where a<120;
  • 153. a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45100 5 45101 92 2An index can select needed rows100 5 45101 92 22}count (*) where a<120;
  • 154. No good index means slow table scansa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45count (*) where b>50 and b<100;
  • 155. No good index means slow table scansa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45count (*) where b>50 and b<100;100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45101 92 2156 56 45256 56 20123
  • 156. You can add an indexa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45alter table foo add key(b);b a5 1006 16523 20643 41256 15656 25692 101202 198
  • 157. A selective index speeds up queriesa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45count (*) where b>50 and b<100;b a5 1006 16523 20643 41256 15656 25692 101202 198
  • 158. A selective index speeds up queriesa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45count (*) where b>50 and b<100;b a5 1006 16523 20643 41256 15656 25692 101202 1983}56 15656 25692 10156 15656 25692 101
  • 159. b a5 1006 16523 20643 41256 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45Selective indexes can still be slowsum(c) where b>50;
  • 160. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45Selective indexes can still be slow56 15656 25692 101202 198Selectingonb:fastsum(c) where b>50;
  • 161. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45Selective indexes can still be slow56 15656 25692 101202 198Fetchinginfoforsummingc:slowSelectingonb:fastsum(c) where b>50;156
  • 162. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45156 56 45Selective indexes can still be slow56 15656 25692 101202 198156 56 45Fetchinginfoforsummingc:slowSelectingonb:fastsum(c) where b>50;
  • 163. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45156 56 45Selective indexes can still be slow56 15656 25692 101202 198156 56 45Fetchinginfoforsummingc:slowSelectingonb:fastsum(c) where b>50;256
  • 164. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45256 56 2156 56 45Selective indexes can still be slow56 15656 25692 101202 198156 56 45256 56 2Fetchinginfoforsummingc:slowSelectingonb:fastsum(c) where b>50;
  • 165. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45256 56 2198 202 56156 56 45101 92 2Selective indexes can still be slow56 15656 25692 101202 198156 56 45256 56 2101 92 2198 202 56Fetchinginfoforsummingc:slowSelectingonb:fastsum(c) where b>50;452256Poordatalocality
  • 166. b a5 1006 16523 20643 41256 15656 25692 101202 19856 15656 25692 101202 198a b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45256 56 2198 202 56156 56 45101 92 2Selective indexes can still be slow56 15656 25692 101202 198105156 56 45256 56 2101 92 2198 202 56Fetchinginfoforsummingc:slowSelectingonb:fastsum(c) where b>50;452256Poordatalocality
  • 167. Indexes provide query performance1. Indexes can reduce the amount of retrieveddata.• Less bandwidth, less processing, ...2. Indexes can improve locality.• Not all data access cost is the same• Sequential access is MUCH faster than random access3. Indexes can presort data.• GROUP BY and ORDER BY queries do post-retrievalwork• Indexing can help get rid of this work
  • 168. b,c a5,45 1006,2 16523,252 20643,45 41256,2 25656,45 15692,2 101202,56 198Covering indexes speed up queriesa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45alter table foo add key(b,c);sum(c) where b>50;
  • 169. b,c a5,45 1006,2 16523,252 20643,45 41256,2 25656,45 15692,2 101202,56 19856,2 25656,45 15692,2 101202,56 198Covering indexes speed up queriesa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 4556,2 25656,45 15692,2 101202,56 198105alter table foo add key(b,c);sum(c) where b>50;56,2 25656,45 15692,2 101202,56 198
  • 170. Indexes provide query performance1. Indexes can reduce the amount of retrieveddata.• Less bandwidth, less processing, ...2. Indexes can improve locality.• Not all data access cost is the same• Sequential access is MUCH faster than random access3. Indexes can presort data.• GROUP BY and ORDER BY queries do post-retrievalwork• Indexing can help get rid of this work
  • 171. b,c a5,45 1006,2 16523,252 20643,45 41256,2 25656,45 15692,2 101202,56 198Indexes can avoid post-selection sortsa b c100 5 45101 92 2156 56 45165 6 2198 202 56206 23 252256 56 2412 43 45select b, sum(c) group by b;sum(c) where b>50;b sum(c)5 456 223 25243 4556 4792 2202 56
  • 172. Data Structures and Algorithms for Big DataModule 6: Log Structured Merge TreesMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek1
  • 173. Log Structured Merge TreesLog structured merge trees are write-optimized datastructures developed in the 90s.Over the past 5 years, LSM trees have becomepopular (for good reason).Accumulo, Bigtable, bLSM, Cassandra, HBase,Hypertable, LevelDB are LSM trees (or borrow ideas).http://nosql-database.org lists 122 NoSQLdatabases. Many of them are LSM trees.2[ONeil, Cheng,Gawlick, ONeil 96]
  • 174. Don’t Thrash: How to CacheYour Hash in FlashRecall Optimal Search-Insert Tradeoff [Brodal,Fagerberg 03]insert point queryOptimaltradeoff(function of ɛ=0...1)O log1+B" NO✓log1+B" NB1 "◆3LSM trees don’t lie on the optimal search-inserttradeoff curve.But they’re not far off.We’ll show how to move them back onto theoptimal curve.
  • 175. Log Structured Merge TreeAn LSM tree is a cascade of B-trees.Each tree Tj has a target size |Tj | .The target sizes are exponentially increasing.Typically, target size |Tj+1| = 10 |Tj |.4[ONeil, Cheng,Gawlick, ONeil 96]T0 T1 T2 T3 T4
  • 176. LSM Tree OperationsPoint queries:5T0 T1 T2 T3 T4
  • 177. LSM Tree OperationsPoint queries:Range queries:5T0 T1 T2 T3 T4T0 T1 T2 T3 T4
  • 178. LSM Tree OperationsInsertions:• Always insert element into the smallest B-tree T0.• When a B-tree Tj fills up, flush into Tj+1 .6T0 T1 T2 T3 T4T0 T1 T2 T3 T4insertflush
  • 179. LSM Tree OperationsDeletes are like inserts:• Instead of deleting anelement directly, inserttombstones.• A tombstone knocks out a“real” element when it landsin the same tree.7T0 T1 T2 T3 T4T0 T1 T2 T3 T4insert tombstonemessages
  • 180. Static-to-Dynamic TransformationAn LSM Tree is an example of a “static-to-dynamic” transformation .• An LSM tree can be built out of static B-trees.• When T3 flushes into T4, T4 is rebuilt from scratch.8[Bentley, Saxe ’80]T0 T1 T2 T3 T4flush
  • 181. This Module9Let’s analyze LSM trees.BM
  • 182. I/O modelsRecall: Searching in an Array Versus B-treeRecall the cost of searching in an array versusa B-tree.10O(logBN)O(logB N) = O✓log2 Nlog2 B◆
  • 183. I/O modelsRecall: Searching in an Array Versus B-treeRecall the cost of searching in an array versusa B-tree.10O(logBN)O(logB N) = O✓log2 Nlog2 B◆O✓log2NB◆⇡ O(log2 N)
  • 184. Analysis of point queriesSearch cost:11T0 T1 T2 T3 Tlog N...logB N + logB N/2 + logB N/4 + · · · + logB B= O(log N logB N)=1log B(log N + log N 1 + log N 2 + log N 3 + · · · + 1)
  • 185. Insert AnalysisThe cost to flush a tree Tj of size X is O(X/B).• Flushing and rebuilding a tree is just a linear scan.The cost per element to flush Tj is O(1/B).The # times each element is moved is ≤ log N.The insert cost is O((log N)/B) amortizedmemory transfers.12Tj has size X.A flush costs O(1/B) per element.Tj+1 has size ϴ(X).
  • 186. Samples from LSM Tradeoff Curvesizes grow by B(ɛ=1)O✓logB NpB◆O (logB N)O✓log NB◆sizes double(ɛ=0)O✓log1+B" NB1 "◆O (logB N)(log1+B" N )point querytradeoff(function of ɛ)insertO ((logB N)(log N))O ((logB N)(logB N))O ((logB N)(logB N))13sizes grow by B1/2(ɛ=1/2)
  • 187. How to improve LSM-tree point queries?Looking in all those trees is expensive, but canbe improved by• caching,• Bloom filters, and• fractional cascading.14T0 T1 T2 T3 T4
  • 188. Caching in LSM treesWhen the cache is warm, small trees arecached.15T0 T1 T2 T3 T4When the cache is warm,these trees are cached.
  • 189. Bloom filters in LSM treesBloom filters can avoid point queries forelements that are not in a particular B-tree.We’ll see how Bloom filters work later.16T0 T1 T2 T3 T4
  • 190. Fractional cascading reduces the cost in each treeInstead of avoiding searches in trees, we can use atechnique called fractional cascading to reduce thecost of searching each B-tree to O(1).17T0 T1 T2 T3 T4Idea: We’re looking for a key, and we alreadyknow where it should have been in T3, try touse that information to search T4.
  • 191. Searching one tree helps in the nextLooking up c, in Ti we know it’s between b, and e.18a c d f h i j k m n p q t u y zTi+1Tib e v wShowing only the bottom level of each B-tree.
  • 192. Forwarding pointersIf we add forwarding pointers to the first tree, wecan jump straight to the node in the second tree, tofind c.19a c d f h i j k m n p q t u y zTi+1Tib e v w
  • 193. Remove redundant forwarding pointersWe need only one forwarding pointer for each blockin the next tree. Remove the redundant ones.20a c d f h i j k m n p q t u y zTi+1Tib e v w
  • 194. Ghost pointersWe need a forwarding pointer for every block in thenext tree, even if there are no correspondingpointers in this tree. Add ghosts.21a c d f h i j k m n p q t u y zTi+1Tib e v wghostsh m
  • 195. LSM tree + forward + ghost = fast queriesWith forward pointers and ghosts, LSM trees requireonly one I/O per tree, and point queries cost only.22a c d f h i j k m n p q t u y zTi+1Tib e v wghostsh m[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson 07]O(logR N)
  • 196. LSM tree + forward + ghost = COLAThis data structure no longer uses the internal nodesof the B-trees, and each of the trees can beimplemented by an array.23[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson 07]Ti+1Tib e v wm n p qh i j k t u y za c d fghostsh mText
  • 197. Data Structures and Algorithms for Big DataModule 7: Bloom FiltersMichael A. BenderStony Brook & TokutekBradley C. KuszmaulMIT & Tokutek1
  • 198. Approximate Set Membership ProblemWe need a space-efficient in-memory datastructure to represent a set S to which we canadd elements. We want to answer membershipqueries approximately:• If x is in S then we want query(x,S) to return true.• Otherwise we want query(x,S) to usually return false.Bloom filters are a simple data structure tosolve this problem.2
  • 199. How do approximate queries help?Recall for LSM trees (without fractionalcascading), we wanted to avoid looking in atree if we knew a key wasn’t there.Bloom filters allow us to usually avoid thelookup.3T0 T1 T2 T3 T4Bloom filters don’t seem to help withrange queries, however.
  • 200. Simplified Bloom FilterUsing hashing, but instead of storing elementswe simply use one bit to keep track of whetheran element is in the set.• Array A[m] bits.• Uniform hash function h: S --> [0,m).• To insert s: Set A[h(s)] = 1;• To check s: Check if A[h(s)]=1.4
  • 201. Example using Simplified Bloom FilterUse an array of length 6. Insert• insert a, where h(a)=3;• b, where h(b)=5.Look up• a: h(a)=3 Answer is yes. Maybe a is there. (And it is).• b: h(b)=5 Answer is yes. Maybe b is there. (And it is).• c: h(c)=2 Answer is no. Definitely c is not there.• d: h(d)=3 Answer is yes. Maybe d is there. (Nope.)50 0 0 1 0 10 1 2 3 4 5
  • 202. Analysis of Simplified Bloom FilterIf n items are in an array of size m, then thechances of getting a YES answer on an elementthat is not there is .If you fill the array about 30% full, you get abouta 50% odds of a false positive. Each objectrequires about 3 bits.How do you get the odds to be 1% falsepositive?6⇡ 1 e n/m
  • 203. Smaller False PositiveOne way would be to fill the array only 1% full.Not space efficient.Another way would be to use 7 arrays, with 7hash functions. False positive rate becomes1/128.Space is 21 bits per object.7
  • 204. Bloom filterIdea: Don’t use 7 separate arrays, use one arraythat’s 7 times bigger, and store the 7 hashedbits.For a 1% false positive rate, it takes about 10bits per object.8
  • 205. Other Bloom FiltersCounting bloom filters [Fan, Cao, Almeida, Broder 2000]allow deletions by maintaining a 4-bit counterinstead of a single bit per object.Buffered Bloom Filters [Canin, Mihaila, Bhattacharhee, andRoss, 2010] employ hash localization to direct all thehashes of a single insertion to the same block.Cascade Filters [Bender, Farach-Colton, Johnson, Kraner, Kuszmaul,Medjedovic, Montes, Shetty, Spillane, Zadok 2011] supportdeletions, exhibit locality for queries, insertquickly, and are cache-oblivious.9
  • 206. Closing Words10
  • 207. We want to feel your pain.We are interested in hearing about other scalingproblems.Come to talk to us.bender@cs.stonybrook.edubradley@mit.edu11
  • 208. Big Data EpigramsThe problem with big data is microdata.Sometimes the right read optimization is awrite-optimization.As data becomes bigger, the asymptoticsbecome more important.Life is too short for half-dry white-boardmarkers and bad sushi.12

×