Your SlideShare is downloading. ×

CaSSanDra: An SSD Boosted Key-Value Store

623

Published on

This presentation was held by Prashanth Menon at ICDE '14 on April 3, 2014 in Chicago, IL, USA. …

This presentation was held by Prashanth Menon at ICDE '14 on April 3, 2014 in Chicago, IL, USA.

The full paper and additional information is available at:
http://msrg.org/papers/Menon2013

Abstract:
With the ever growing size and complexity of enterprise systems there is a pressing need for more detailed application performance management. Due to the high data rates, traditional database technology cannot sustain the required performance. Alternatives are the more lightweight and, thus, more performant key-value stores. However, these systems tend to sacrifice read performance in order to obtain the desired write throughput by avoiding random disk access in favor of fast sequential accesses.

With the advent of SSDs, built upon the philosophy of no moving parts, the boundary between sequential vs. random access is now becoming blurred. This provides a unique opportunity to extend the storage memory hierarchy using SSDs in key-value stores. In this paper, we extensively evaluate the benefits of using SSDs in commercialized key-value stores. In particular, we
investigate the performance of hybrid SSD-HDD systems and demonstrate the benefits of our SSD caching and our novel dynamic schema model.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
623
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG CaSSanDra:  An  SSD   Boosted  Key-­‐Value  Store Prashanth  Menon,  Tilmann  Rabl,  Mohammad  Sadoghi  (*),   Hans-­‐Arno  Jacobsen !1 *
  • 2. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Outline • ApplicaHon  Performance  Management   • Cassandra  and  SSDs   • Extending  Cassandra’s  Row  Cache   • ImplemenHng  a  Dynamic  Schema  Catalogue   • Conclusions !2
  • 3. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Modern  Enterprise  Architecture • Many  different  soPware  systems   • Complex  interacHons   • Stateful  systems  oPen  distributed/parHHoned/replicated   • Stateless  systems  certainly  duplicated !3
  • 4. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG ApplicaHon  Performance  Management • Lightweight  agent  aSached  to  each  soPware  system  instance   • Monitors  system  health   • Traces  transacHons   • Determines  root  causes   • Raw  APM  metric: !4 Agent Agent Agent Agent Agent Agent AgentAgent Agent Agent Agent Agent
  • 5. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG ApplicaHon  Performance  Management • Problem:  Agents  have  short  memory  and  only  have  a  local  view   • What  was  the  average  response  Hme  for  requests  served  by  servlet  X   between  December  18-­‐31  2011?   • What  was  the  average  Hme  spent  in  each  service/database  to  respond   to  client  requests? !5 Agent Agent Agent Agent Agent Agent AgentAgent Agent Agent Agent Agent
  • 6. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG APM  Metrics  Datastore • All  agents  store  metric  data  in  high  write-­‐throughput  datastore   • Metric  data  is  at  a  fine  granularity  (per-­‐acHon,  millisecond  etc)   • User  now  has  global  view  of  metrics   • What  is  the  best  database  to  store  APM  metrics? !6 Agent Agent Agent Agent Agent Agent AgentAgent Agent Agent Agent Agent ?
  • 7. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Cassandra  Wins  APM • APM  experiments  performed  by  Rabl  et  al.  [1]    show  Cassandra  performs   best  for  APM  use  case   • In  memory  workloads  including  95%,  50%  and  5%  read   • Workloads  requiring  disk  access  with  95%,  50%  and  5%  reads !7 Read: 95% 0 50000 100000 150000 200000 250000 2 4 6 8 10 12 Throughput(Ops/sec) Number of Nodes Cassandra HBase Voldemort VoltDB Redis MySQL Figure 6: Throughput for Workload RW 0.1 1 10 100 1000 2 4 6 8 10 12 Latency(ms)-Logarithmic Number of Nodes Cassandra HBase Voldemort VoltDB Redis MySQL Read: 50% 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 2 4 6 8 10 12 Throughput(Operations/sec) Number of Nodes Cassandra HBase Voldemort VoltDB Redis MySQL Figure 3: Throughput for Workload R million records per node, thus, scaling the problem size with the cluster size. For each run, we used a freshly installed system and loaded the data. We ran the workload for 10 minutes with max- imum throughput. Figure 3 shows the maximum throughput for workload R for all six systems. In the experiment with only one node, Redis has the highest throughput (more than 50K ops/sec) followed by VoltDB. There are no significant differences between the throughput of Cassan- dra and MySQL, which is about half that of Redis (25K ops/sec). Voldemort is 2 times slower than Cassandra (with 12K ops/sec). The slowest system in this test on a single node is HBase with 2.5K operation per second. However, it is interesting to observe that the 0.1 1 10 100 2 4 6 8 10 12 Latency(ms)-Logarithmic Number of Nodes Cassandra HBase Voldemort VoltDB Redis MySQL Figure 4: Read latency for Workload R 0.01 0.1 1 10 100 2 4 6 8 10 12 Latency(ms)-Logarithmic Number of Nodes Cassandra HBase Voldemort VoltDB Redis MySQL Figure 5: Write latency for Workload R [1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf
  • 8. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Cassandra • Built  at  Facebook  by  previous  Dynamo  engineers   • Open  sourced  to  Apache  in  2009   • DHT  with  consistent  hashing   • MD5  hash  of  key   • MulHple  nodes  handle  segments  of  ring  for  load  balancing   • Dynamo  distribuHon  and  replicaHon  model  +  BigTable  storage  model !8 Commit&& Log& Memtable& SS&Tables&
  • 9. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Cassandra  and  SSDs • Improve  performance  by  either  adding  nodes  or  improving  per-­‐ node  performance   • Node  performance  is  directly  dependent  on  the  disk  I/O   performance  of  the  system   • Cassandra  stores  two  enHHes  on  disk:   • Commit  Log   • SSTables   • Should  SSDs  be  used  to  store  both?   • We  evaluated  each  possible  configura<on !9
  • 10. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Experiment  Setup • Server  specificaHon:   • 2x  Intel  8-­‐core  X5450,  16GB  RAM,  2x  2TB  RAID0  HDD,  2x  250GB  Intel  x520  SSD     • Apache  Cassandra  1.10   • Used  YCSB  benchmark   • 100M  rows,  50GB  total  raw  data,  ‘latest’  distribuHon   • 95%  read,  5%  write   • Minimum  three  runs  per  workload,  fresh  data  on  each  run   • Broken  into  phases:   • Data  load   • FragmentaHon   • Cache  warm-­‐up   • Workload  (>  12h  process) !10
  • 11. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG SSD  vs.  HDD • LocaHon  of  log  is  irrelevant   • LocaHon  of  data  is  important   • DramaHc  performance  improvement  of  SSD  over  HDD   • SSD  benefits  from  high  parallelism !11 Configura<on #  of  clients #  of  threads/client Loca<on  of  Data Loca<on  of  Commit  Log C1 1 2 RAID  (HDD) RAID  (HDD) C2 1 2 RAID  (HDD) SSD C3 1 2 SSD RAID  (HDD) C4 1 2 SSD SSD C5 4 16 RAID  (HDD) RAID  (HDD) C6 4 16 SSD SSD 0 1000 2000 3000 4000 5000 6000 7000 8000 C1 C2 C3 C4 C5 C6 Throughput(ops/sec) Configuration (a) HDD vs SSD Throughput 0 1 2 3 4 5 6 7 8 C1 C2 C3 C4 C5 C6 Latency(ms) Configuration (b) HDD vs SDD Latency 0 1000 2000 3000 4000 5000 6000 7000 8000 HDD Throughput(ops/sec) Data Empty Disk Full Disk (c) 99% Fill HDD v Fig. 4. Throughput/Latency Results for HDD vs SSD and D on HDD for the bulk of data that is infrequently accessed. Another reason to do this is the fact that SSD performance degrades with higher fill ratios. As seen in Figure 4(c), the performance of a highly filled SSD degrades much worse than This is becau the SSD; in f twice the amo alone, achiev
  • 12. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG SSD  vs.  HDD  (II) • SSD  offers  more  than  7x  improvement  to  throughput  on  empty  disk   • SSD  performance  degrades  by  half  as  storage  device  fills  up   • Filling  the  SSD  or  running  it  near  capacity  is  not  advisable !12 3 C4 C5 C6 iguration SDD Latency 0 1000 2000 3000 4000 5000 6000 7000 8000 HDD SSD Throughput(ops/sec) Data Location Empty Disk Full Disk (c) 99% Fill HDD vs SDD Throughput 0 50 100 150 200 250 HDD SSD Latency(ms) Data Location Empty Disk Full Disk (d) 99% Fill HDD vs SDD Latency t/Latency Results for HDD vs SSD and Disk Full vs Disk Empty quently accessed. SSD performance Figure 4(c), the much worse than s to be noted that , for write heavy experienced. This is because a larger portion of the hot data is cached on the SSD; in fact, our configuration enabled storing more than twice the amount of data than when using an in-memory cache alone, achieving a cache-hit ratio of more than 85%. When a read operation reaches the server for a row that does not reside in the off-heap memory cache, only a single SSD seek is required to fulfill the request. In addition, cached data is
  • 13. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG SSD  vs.  HDD:  Summary • Cassandra  benefits  most  when  storing  data  on  SSD  (not  the  log)   • LocaHon  of  commit  log  not  important   • SSD  performance  inversely  proporHonal  to  fill  raHo   • Storing  all  data  on  SSD  is  uneconomical   • Replacing  3TB  HDD  with  3x  1TB  SSD  is  10x  more  costly   • SSDs  have  limited  lifeHme  (10-­‐50K  write-­‐erase  cycles),  replacement   more  frequently   • Rabl  et  al.  [1]  show  adding  node  is  100%  costlier,  with  100%  throughput   improvement   • Build  hybrid  system  to  get  comparable  performance  for  marginal  cost !13
  • 14. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Cassandra:  Read  +  Write  Path • Write  path  is  fast:   1. Write  update  into  commit  log   2. Write  update  into  Memtable   • Memtables  flush  to  SSTables  asynchronously   when  full   • Never  blocks  writes   • Read  path  can  be  slow:   1. Read  key-­‐value  from  Memtable   2. Read  key-­‐value  from  each  SSTable  on  disk   3. Construct  merged  view  of  row  from  each   input  source !14 ReadUpdate Memtable SSTableSSTableSSTable SSTableSSTableSSTable Memory • Each  read  needs  to  do  O(#  of  SSTables)  I/O Disk Log
  • 15. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Cassandra:  SSTables • Cassandra  allows  blind-­‐writes   • Row  data  can  be  fragmented  over  mulHple  SSTables  over  Hme   ! ! ! ! • Bloom  filters  and  indexes  can  potenHally  help   • Ul<mately,  mul<ple  fragments  need  to  be  read  from  disk !15 Employee(ID( First(Name( Last(Name( Age( Department(ID( 99231234& Prashanth& Menon& 25& MSRG& {SSTables
  • 16. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Cassandra:  Row  Cache • Row  cache  buffers  full  merged  row  in   memory   • Cache  miss  follows  regular  read  path,   constructs  merged  row,  brings  into  cache   • Makes  read  path  faster  for  frequently   accessed  data   • Problem:  Row  cache  occupies  memory   • Takes  away  precious  memory  from   rest  of  system !16 • Extend  the  row  cache  efficiently  onto  SSD ReadUpdate Memtable SSTableSSTableSSTable SSTableSSTableSSTable Memory Disk Log Row Cache
  • 17. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Extended  Row  Cache • Extend  the  row  cache  onto  SSD   • Chained  with  in-­‐memory  row  cache   • LRU  in-­‐memory,  overflow  onto  LRU   SSD  row  cache   • Implemented  as  append-­‐only  cache  files   • Efficient  sequenHal  writes   • Fast  random  reads   • Zero  I/O  for  hit  in  first  level  row  cache   • One  random  I/O  on  SSD  for  second  level   row  cache   !17 Log SSTableSSTableSSTable SSTableSSTableSSTable Memory Memtable 1rst Level Row Cache 2nd Level Cache Index Disk 2nd Level Row Cache SSD ReadUpdate
  • 18. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG EvaluaHon:  SSD  Row  Cache • Setup:   • 100M  rows,  50GB  total  data,  6GB  row  cache   • Results:   • 75%  improvement  in  throughput   • 75%  improvement  in  latency   • RAM-­‐only  cache  has  too  liSle  hit  raHo !18 0 200 400 600 800 1000 95% 85% 75% Throughput(ops/sec) Read Percentage Disabled RAM RAM+SSD (a) Row Cache (Throughput) 0 1 2 3 4 5 6 7 8 95% 85% 75% Latency(ms) Read Percentage Disabled RAM RAM+SSD (b) Row Cache (Latency) 0 1000 2000 3000 4000 5000 6000 7000 95% Throughput(ops/sec) Re Regular Dynamic (c) Dynamic Sc Fig. 5. Throughput/Latency Results for Row Cache Exten and we find this to be much more compelling. In normal operation, data sizes averaged 6.8GB compressed after the initial load of 40 million keys. With a modified Cassandra, data sizes averaged at 6.01GB of data, a savings of roughly 10%. This value will grow as the number of columns in the table grow and as column names grow in length. Another potential benefit for dynamic schema model (omit- we identify key-value s In this p SSDs in k figurations and implem
  • 19. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Dynamic  Schema • Key-­‐value  stores  covet  schema-­‐less  data  model   • Very  flexible,  good  for  highly  varying  data   • Schemas  oPen  change,  defining  up  front  can  be  detrimental   ! ! ! ! ! ! • ObservaHon:  many  big  data  applicaHons  have  relaHvely  stable  schemas   • e.g.,  Click  stream,  APM,  sensor  data  etc.   • Redundant  schemas  have  significant  overhead  in  I/O  and  space  usage !19 Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1' Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1' Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1' OnHDisk'Format' Metric'Name' Timestamp' Value' Max' Min' HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1' ApplicaKon'Format'
  • 20. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Dynamic  Schema  (III) • Don’t  serialize  redundant  schema  with  rows   • Extract  schema  from  data,  store  on  SSD,  serialize  schema  ID  with  data   • Allows  for  large  number  of  schemas !20 Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1' Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1' Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1' S1' S2' Metric'Name'Timestamp' Value' Max' Min' Metric'Name'Timestamp' All' Warn' Error' HostA/AgentX/AVGResponse'1332988833'S1' 4' 6' 1' HostA/AgentX/AVGResponse'1332988848' HostA/AgentX/Failures' 1332988849' S1' S2' 5' 7' 1' 4' 3' 1' New'Disk'Format'Schema'Catalogue' Old'Disk'Format' SSD
  • 21. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG EvaluaHon:  Dynamic  Schema • Setup:   • 40M  rows,  variable  columns  5-­‐10  (638  schemas),  6GB  row  cache   • Results:   • 10%  reducHon  in  disk  usage  (6.8GB  vs  6GB)   • Slightly  improved  throughput,  stable  latency   • EffecHve  SSD  usage  (only  random  reads)  &  reduce  I/O  and  space  usage !21 85% 75% Percentage he (Latency) 0 1000 2000 3000 4000 5000 6000 7000 95% 50% 5% Throughput(ops/sec) Read Percentage Regular Dynamic (c) Dynamic Schema (Throughput) 0 20 40 60 80 100 120 140 95% 50% 5% Latency(ms) Read Percentage Regular Dynamic (d) Dynamic Schema (Latency) atency Results for Row Cache Extension and Dynamic Schema ing. In normal essed after the fied Cassandra, ngs of roughly columns in the th. ma model (omit- we identify new avenues for exploiting the use of SSDs within key-value stores, namely, our dynamic cataloguing technique. VIII. CONCLUSION In this paper, we investigated the performance benefits of SSDs in key-value stores. We benchmarked different con- figurations of SSD and HDD combinations. We proposed and implemented two specific optimizations for SSD-HDD
  • 22. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Conclusions • Storing  Cassandra  commit  logs  on  SSD  doesn’t  help   • Managing  SSDs  at  capacity  degrades  its  performance   • Using  SSDs  as  a  secondary  row-­‐cache  dramaHcally   improves  performance   • ExtracHng  redundant  schemas  onto  and  SSD  reduces   disk  space  usage  and  required  I/O !22
  • 23. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Thanks! ! • QuesHons?   ! • Contact:     • Prashanth  Menon  (prashanth.menon@utoronto.ca) !23
  • 24. UNIVERSITY OF TORONTO UNIVERSITY OF TORONTO Fighting back: Using observability tools to improve the DBMS (not just diagnose it) Ryan Johnson MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Future  Work • What  types  of  tables  benefit  most  from  a  dynamic   schema?   • Impact  of  compacHon  on  read-­‐heavy  workloads   • How  can  SSDs  be  used  to  improve  the  performance  of   compacHon?   • How  is  performance  when  storing  only  SSTable  indexes   on  SSD? !24

×