Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VLDB2013 Session 1 Emerging Hardware


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

VLDB2013 Session 1 Emerging Hardware

  1. 1. 担当: 若森拓馬(NTT) 【VLDB2013勉強会】 Session 1: Emerging Hardware
  2. 2. Hybrid Storage Management for Database Systems 2 }  目的:SSDとHDDからなるハイブリッドなストレージシステムを データベース管理に用いる }  貢献 }  GD2L:ハイブリッドストレージにおけるcost-awareなbuffer pool管理 アルゴリズムの提案 }  GreedyDualアルゴリズム[1]をベースに,SSDとHDDのパフォーマンスの差異 を考慮 }  CAC:予測的なコストベースのSSD管理技術の提案 }  GD2LとCACの実験によるパフォーマンス評価 }  MySQLのInnoDBストレージエンジンに提案機構を実装 Session 1: Emerging Hardware 担当:若森拓馬(NTT) システムアーキテクチャ(Figure: 1より) ustrated in Figure 1. When writing data to storage, the MS chooses which type of device to write it to. RAM Buffer pool manager SSD manager HDD read/write requests Buffer pool SSD write Figure 1: System Architecture revious work has considered how a DBMS should place in a hybrid storage system [11, 1, 2, 16, 5]. We provide mmary of such work in Section 7. In this paper, we a broader view of the problem than is used by most his work. Our view includes the DBMS bu↵er pool as as the two types of storage devices. We consider two ted problems. The first is determining which data should etained in the DBMS bu↵er pool. The answer to this tion is a↵ected by the presence of hybrid storage because ks evicted from the bu↵er cache to an SSD are much er to retrieve later than blocks evicted to the HDD. Thus, onsider cost-aware bu↵er management, which can take distinction into account. Second, assuming that the is not large enough to hold the entire database, we e the problem of deciding which data should be placed on SSD. This should depend on the physical access pattern [1] N. Young. The k-server dual and loose competitiveness for paging. Algorithmica, 11:525–541, 1994. •  全てのDBのページはHDDに格納され,いくつかのページのコ ピーがSSDとbuffer poolのいずれかor両方に格納されている •  buffer poolがいっぱいになった時,buffer pool managerが ページ置換ポリシに基づいてページをbufferから取り除く •  SSD managerは, •  SSD追加ポリシに基づいて,buffer poolから取り除かれ たページをSSDに追記する •  SSD置換ポリシに基づいて,SSDからデータを取り除く
  3. 3. GD2L:cost-awareなバッファプール管理アルゴリズム 3 }  前提 }  ページpはコストHをもつ }  L(inflation value)を用いてHを計算 する }  2つの優先度付キューを使用する }  QS:SSD上のページ管理用 }  QD:SSD上にないページ管理用 }  アルゴリズム }  ページpがキャッシュされていない 時 }  QsのLRUページとQDのLRUページ のコストHを比較し,小さい方のHを もつページqを取り除く }  L←H(q),pをキャッシュに置く }  pがSSD上にあるとき }  H(p)←L + RS,pをQsのMRUに置く }  それ以外で,pがHDD上にあるとき }  H(p) = L + RD,pをQDのMRUに置く Session 1: Emerging Hardware 担当:若森拓馬(NTT) H S S S S H H H H HQD QS S QSLRU QDLRU Page is flushed to the SSD Page is evicted from the SSD Figure 3: Bu↵er pool managed by GD2L GD2Lによるbuffer pool管理(Figure: 3より) ストレージデバイスのパラメータ RD:HDDからの読み込み時間 WD:HDDへの書き込み時間 RS:SSDからの読み込み時間 WS:SSDへの書き込み時間 buffer poolの空きが足りなくなった時,page cleanerは dirty bufferをcost-awareのポリシに基づいて ストレージにフラッシュする
  4. 4. cost-awareキャッシングの効果 4 }  ページがSSDにキャッシュ されているとき,buffer poolミス率と物理書き込み 率が高くなる }  理由 }  ページが一度SSDに置か れると,そのページは読み 出しコストが低いため, buffer poolから取り除かれ やすくなる }  page cleanerは取り除く前 にdirty pageをフラッシュし なければならない Session 1: Emerging Hardware 担当:若森拓馬(NTT) gure 3: Bu↵er pool managed by GD2L types of writes: replacement writes and recoverabil- s. Replacement writes are issued when dirty pages ified as eviction candidates. To remove the latency d with synchronous writes, the page cleaners try to hat pages that are likely to be replaced are clean me of the replacement. In contrast, recoverabil- s are those that are used to limit failure recovery he InnoDB uses write ahead logging to ensure that ed database updates survive failures. The failure time depends on the age of the oldest changes in r pool. The page cleaners issue recoverability writes ast recently modified pages to ensure that a config- covery time threshold will not be exceeded. oDB, when the free space of the bu↵er pool is below Figure 4: Miss rate/write rate while on the HDD (only) vs. miss rate/write rate while on the SSD. Each point represents one page HDDのキャッシュミス/書き込み率 対 SSDのキャッシュミス/書き込み率 (Figure: 4より)
  5. 5. CAC:Cost-Adjusted Caching 5 }  pをSSDに置くかどうかを決定するために,利得Bを見積もる }  p’がすでにSSDキャッシュにあるとき,B(p’)<B(p)を満たすpをSSD キャッシュに置き,最も小さいBをもつページを取り除く }  miss rate expansion factor (α) }  ページの物理read/write率が, SSDにページがあるかどうかで どのように変化するかを見積もる ために導入 Session 1: Emerging Hardware 担当:若森拓馬(NTT) Figure 5: Summary of Notation Symbol Description rD, wD Measured physical read/write count while not on the SSD rS, wS Measured physical read/write count while on the SSD crD, cwD Estimated physical read/write count if never on the SSD crS, cwS Estimated physical read/write count if al- ways on the SSD mS Bu↵er cache miss rate for pages on the SSD mD Bu↵er cache miss rate for pages not on the SSD ↵ Miss rate expansion factor Figure 6: Miss rate expansion factor for pages from three TPC-C tables. expansion factors of pages grouped by table and by logical 3つのTPC-Cのテーブル(のページ)に対する miss rate expansion factor(Figure: 6より) would experience if it were placed on the SSD, and crD(p) and cwD(p) are the physical read and write counts p would experience if it were not. Using these hypothetical physical read and write counts, we can write our desired estimate of the benefit of placing p on the SSD as follows B(p) = (crD(p)RD crS(p)RS)+( cwD(p)WD cwS(p)WS) (2) Thus, the problem of estimating benefit reduces to the prob- lem of estimating values for crD(p), crS(p), cwD(p), and cwS(p). To estimate crS(p), CAC uses two measured read counts: rS(p) and rD(p). (In the following, we will drop the explicit page reference from our notation as long as the page is clear from context.) In general, p may spend some time on the SSD and some time not on the SSD. rS is the count of the number of physical reads experienced by p while p is on the SSD. rD is the number of physical reads experienced by p while it is not on the SSD. To estimate what p’s physical read count would be if it were on the SSD full time (crS), CAC uses a room for a new entry. The Miss Rate Expansion Factor urpose of the miss rate expansion factor (↵) is to how much a page physical read and write rates nge if the page is admitted to the SSD. A simple estimate ↵ is to compare the overall miss rates of the SSD to that of pages that are not on the SSD. that mS represents the overall miss rate of logical uests for pages that are on the SSD, i.e., the total of physical reads from the SSD divided by the total of logical reads of pages on the SSD. Similarly, let esent the overall miss rate of logical read requests s that are not located on the SSD. Both mS and easily measured. Using mS and mD, we can define rate expansion factor as: ↵ = mS mD (7) mple, ↵ = 3 means that the miss rate is three times ages for on the SSD than for pages that are not on where mS(g) is they are in the pages in g while We track logi vidual page, as counts are upda to the page. Gr tain events occu pages. Specifica is evicted from from the bu↵er from the SSD. B their logical rea which a page b occurs, we subt old group and a It is possible t groups. For ex Figure 5: Summary of Notation Symbol Description rD, wD Measured physical read/write count while not on the SSD rS, wS Measured physical read/write count while on the SSD crD, cwD Estimated physical read/write count if never on the SSD crS, cwS Estimated physical read/write count if al- ways on the SSD mS Bu↵er cache miss rate for pages on the SSD mD Bu↵er cache miss rate for pages not on the SSD ↵ Miss rate expansion factor scaling it up to account for any time in which the page was not on the SSD. While this might be e↵ective, it will work only if the page has actually spent time on the SSD to that rS can be observed. We still require a way to estimate rS for pages that have not been observed on the SSD. In contrast, estimation using Equation 3 will work even if rS or rD are zero due to lack of observations. We track reference counts for all pages in the bu↵er pool and all pages in the SSD. In addition, we maintain an out- queue for to track reference counts for a fixed number (Noutq) of additional pages. When a page is evicted from the SSD, Figure 6: Miss rate expansion fact three TPC-C tables. expansion factors of pages grouped by read rate. The three lines represent pag C STOCK, CUSTOMER, and ORDER Since di↵erent pages may have substa rate expansion factors, we use di↵erent di↵erent groups of pages. Specifically, pages based on the database object (e they store data, and on their logical rea a di↵erent expansion factor for each gr range of possible logical read rates into size. We define a group as pages tha same database object and whose logic the same subrange. For example, in o : SSD上に全く無いときの物理read/writeの見積もり数 : SSD上に常にあるときの物理read/writeの見積もる数 :SSD上のページに対する バッファキャッシュミス率 :SSD上にないページに対する バッファキャッシュミス率
  6. 6. GD2LとCACのパフォーマンス評価 6 Session 1: Emerging Hardware 担当:若森拓馬(NTT) Next, we consider the impact of switching from a non- anticipatory cost-based SSD manager (CC) to an antici- patory one (CAC). Figure 7 shows that GD2L+CAC pro- vides additional performance gains above and beyond those achieved by GD2L+CC in the case of the large (30GB) database. Together, GD2L and CAC provide a TPC-C per- formance improvement of about a factor of two relative to the LRU+CC baseline in our 30GB tests. The performance gain was less significant in the 15GB database tests and non-existent in the 8GB database tests. Figure 8 shows that GD2L+CAC results in lower total I/O costs on both the SSD and HDD devices, relative to GD2L+CC, in the 30GB experiments. Both policies re- sult in similar bu↵er pool hit ratios, so the lower I/O cost achieved by GD2L-CAC is attributable to better decisions about which pages to retain on the SSD. To better under- stand the reasons for the lower total I/O cost achieved by CAC, we analyzed logs of system activity to try to iden- tify specific situations in which GD2L+CC and GD2L+CAC make di↵erent placement decisions. One interesting situa- tion we encounter is one in which a very hot page that is in the bu↵er pool is placed in the SSD. This may occur, for ex- ample, when the page is cleaned by the bu↵er manager and there is free space in the SSD, either during cold start and because of invalidations. When this occurs, I/O activity for the hot page will spike because GD2L will consider the page to be a good eviction candidate. Under the CC policy, such a page will tend to remain in the SSD because CC prefers to keep pages with high I/O activity in the SSD. In contrast, CAC is much more likely to to evict such a page from the SSD, since it can (correctly) estimate that moving the page will result in a substantial drop in I/O activity. Thus, we find that GD2L+CAC tends to keep very hot pages in the bu↵er pool and out of the SSD, while with GD2L+CC such pages tend to remain in the SSD and bounce into and out of the bu↵er pool. Such dynamics illustrate why it is impor- tant to use an anticipatory SSD manager (like CAC) if the bu↵er pool manager is cost-aware. For the experiments with smaller databases (15GB and 8GB), there is little di↵erence in performance between GD2L+CC and GD2L+CAC. Both policies result in similar per-transaction I/O costs and similar TPC-C throughput. This is not sur- prising, since in these settings most or all of the hot part of the database can fit into the SSD, i.e., there is no need to be marginally faster than LRU+LRU2, and for the smallest database (8GB) they were essentially indistinguishable. LRU- +MV-FIFO performed much worse than the other two al- gorithms in all of the scenarios we test. Figure 10: TPC-C Throughput Under LRU+MV- FIFO, LRU+LRU2, GD2L+LRU2, and GD2L+CAC for Various Database and DBMS }  GD2L+CACが最も良いパ フォーマンスを達成 Alg & HDD HDD SSD SSD total BP miss BP size util I/O util I/O I/O rate (GB) (%) (ms) (%) (ms) (ms) (%) GD2L+CAC 1G 85 39.6 19 8.8 48.4 7.4 2G 83 31.8 20 7.8 39.6 6.3 4G 82 23.5 20 5.8 29.3 4.8 LRU+LRU2 1G 85 49.4 15 8.6 58.0 6.7 2G 87 50.3 12 6.7 57.0 4.4 4G 90 48.5 9 4.7 53.2 2.4 GD2L+LRU2 1G 73 41.1 43 24.6 65.8 10.7 2G 79 46.5 32 18.9 65.3 8.8 4G 79 34.4 32 13.8 48.2 7.6 LRU+FIFO 1G 91 101.3 8 9.2 110.5 6.6 2G 92 92.3 6 5.8 98.1 4.4 4G 92 62.9 5 3.7 66.6 2.5 Figure 11: Device Utilizations, Bu↵er Pool Miss Rate, and Normalized I/O Time (DB size=30GB) to 400M. We tested k set to 1%, 2%, 5% and size, Our results showed that k values in th impact on TPC-C throughput. In InnoDB, fier is eight bytes and the size of each page is hash map for a 400M SSD fits into ten pag the rate with which the SSD hash map was fl that even with k = 1%, the highest rate of ch hash map experienced by any of the three SS algorithms (CAC, CC, and LRU2) is less tha ond. Thus, the overhead imposed by checkp map is negligible. 7. RELATED WORK Placing hot data in fast storage (e.g. h cold data in slow storage (e.g. tapes) is n Hierarchical storage management (HSM) is technique which automatically moves data cost and low-cost storage media. It uses f TPC-Cスループット(Figure: 10より) デバイス利用率,buffer poolミス率,正規化後I/O時間(Figure: 11より)
  7. 7. 参考: GreedyDualアルゴリズム 7 }  前提:オブジェクトサイズが均等で,異なる読み出し コストを持つ }  ページpに非負のコストHを与える }  あるページがキャッシュに移動または参照されたとき, Hはページをキャッシュに読み出すコストに設定され る }  キャッシュに空きを作るとき,キャッシュ上で最も小さ なH(Hmin)をもつページが除かれる }  残りの全てのページのHからHminを引く Session 1: Emerging Hardware 担当:若森拓馬(NTT)