Your SlideShare is downloading. ×
ICDE2010 Nb-GCLOCK
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ICDE2010 Nb-GCLOCK

2,184
views

Published on

Makoto Yui, Jun Miyazaki, Shunsuke Uemura and Hayato Yamana. ``Nb-GCLOCK: A Non-blocking Buffer Management based on the Generalized CLOCK'', …

Makoto Yui, Jun Miyazaki, Shunsuke Uemura and Hayato Yamana. ``Nb-GCLOCK: A Non-blocking Buffer Management based on the Generalized CLOCK'',
Proc. ICDE, March 2010.

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,184
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Nb-GCLOCK:A Non-blocking Buffer Managementbased on the Generalized CLOCK Makoto YUI1, Jun MIYAZAKI2, Shunsuke UEMURA3 and Hayato YAMANA41 .Research fellow, JSPS (Japan Society for the Promotion of Science) / Visiting Postdoc at Waseda University, Japan and CWI, Netherlands2. Nara Institute of Science and Technology3. Nara Sangyo University4. Waseda University / National Institute of Informatics
  • 2. Outline• Background• Our approach – Non-Blocking Synchronization – Nb-GCLOCK• Experimental Evaluation• Related Work• Conclusion 2
  • 3. Background – Recent trends in CPU development # of CPU cores in a chip Many-Core CPU is doubling in two year cycles UltraSparc T2 Azul Vega Larrabee? Multi-Core CPU NehalemSingle-Core CPU Core2 Power4 Pentium 2000 Many-core era is coming. 1990 3
  • 4. Background – Recent trends in CPU development # of CPU cores in a chip Many-Core CPU is doubling in two year cycles UltraSparc T2 Azul Vega Larrabee? Multi-Core CPU NehalemSingle-Core CPU Core2 Power4 Pentium 2000 Many-core era is coming. 1990 - Niagara T2 – 8 cores x 8 SMT = 64 processors - Azul Vega3 – 54 cores x 16 chips = 864 processors 4
  • 5. Background – CPU Scalability of open source DBsOpen source DBs have faced CPU scalability problemsRyan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”,In Proc. EDBT, 2009. 5
  • 6. Background – CPU Scalability of open source DBsOpen source DBs have faced CPU scalability problemsRyan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”,In Proc. EDBT, 2009. 10 PostgreSQL 8 MySQL BDB 6 4 2 0 1 4 8 12 16 24 32 Microbenchmark on UltraSparc T1 (32 procs) 6
  • 7. Background – CPU Scalability of open source DBs Open source DBs have faced CPU scalability problems Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009. 10 PostgreSQL 8 MySQL BDBThroughput 6(normalized) 4 2 0 Concurrent 1 4 8 12 16 24 32 threads Microbenchmark on UltraSparc T1 (32 procs) 7
  • 8. Background – CPU Scalability of open source DBs Open source DBs have faced CPU scalability problems Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009. Gain after 16 threads 10 is less than 5 % PostgreSQL 8 MySQL BDBThroughput 6(normalized) 4 2 0 Concurrent 1 4 8 12 16 24 32 threads Microbenchmark on UltraSparc T1 (32 procs) 8
  • 9. Background – CPU Scalability of open source DBs Open source DBs have faced CPU scalability problems Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009. Gain after 16 threads 10 is less than 5 % PostgreSQL 8 MySQL BDBThroughput 6(normalized) 4 2 You might think… What about TPC-C ? 0 Concurrent 1 4 8 12 16 24 32 threads Microbenchmark on UltraSparc T1 (32 procs) 9
  • 10. CPU scalability of PostgreSQLTPC-C benchmark result on a high-end Linux machine of Unisys(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage) Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007. 10
  • 11. CPU scalability of PostgreSQLTPC-C benchmark result on a high-end Linux machine of Unisys(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage) Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.TPSVersion 8.2 CPU coresVersion 8.1Version 8.0 11
  • 12. CPU scalability of PostgreSQLTPC-C benchmark result on a high-end Linux machine of Unisys(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage) Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.TPS Gain after 16 CPU cores is less than 5%Version 8.2 CPU coresVersion 8.1Version 8.0 12
  • 13. CPU scalability of PostgreSQLTPC-C benchmark result on a high-end Linux machine of Unisys(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage) Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.TPS Gain after 16 CPU cores is less than 5% Q. What PostgreSQL community did?Version 8.2 CPU coresVersion 8.1Version 8.0 13
  • 14. CPU scalability of PostgreSQLTPC-C benchmark result on a high-end Linux machine of Unisys(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage) Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.TPS Gain after 16 CPU cores is less than 5% Q. What PostgreSQL community did?Version 8.2 CPU coresVersion 8.1 Revised their synchronization mechanisms in the buffer management moduleVersion 8.0 14
  • 15. Synchronization in Buffer Management ModuleSeveral empirical studies have revealed that the largest bottleneck is …synchronization in buffer management module [1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.
  • 16. Synchronization in Buffer Management Module Several empirical studies have revealed that the largest bottleneck is … synchronization in buffer management module [1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008. CPU Page requestsreduces disk accessby caching database pages Buffer Memory Manager HDD Database Files
  • 17. Synchronization in Buffer Management Module Several empirical studies have revealed that the largest bottleneck is … synchronization in buffer management module [1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008. CPU Page requests Page requestsreduces disk access Buffer Managerby caching database pages (1) Looking-up hash table Buffer Memory Manager misses hits (2) Page replacement algorithm HDD Database Database Files Files 20
  • 18. Synchronization in Buffer Management Module Several empirical studies have revealed that the largest bottleneck is … synchronization in buffer management module [1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008. CPU Page requests Page requestsreduces disk access Buffer Managerby caching database pages (1) Looking-up hash table Buffer Memory Manager misses hits (2) Page replacement algorithm HDD Database Database Files Files 18
  • 19. Synchronization in Buffer Management Module Several empirical studies have revealed that the largest bottleneck is … synchronization in buffer management module [1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008. CPU Page requests Page requestsreduces disk access Buffer Managerby caching database pages (1) Looking-up hash table Buffer Memory Manager misses hits (2) Page replacement algorithm HDD Database Database Files Files 19
  • 20. Naive buffer management schemes Page requests Page requests Hash Hash Hash Hash Looking-up hash table bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (Least Recently Used) Database Database Files Files PostgreSQL 8.0 PostgreSQL 8.1 20
  • 21. Naive buffer management schemes Page requests Page requests Giant lock sucks! Hash Hash Hash Hash Looking-up hash table bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (Least Recently Used) Database Database Files Files PostgreSQL 8.0 PostgreSQL 8.1 21
  • 22. Naive buffer management schemes Page requests Page requests Giant lock sucks! Hash Hash Hash Hash Looking-up hash table bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (Least Recently Used) LRU list always needs to be Database Database Files locked when it is accessed Files PostgreSQL 8.0 PostgreSQL 8.1 22
  • 23. Naive buffer management schemes Page requests Page requests Giant lock sucks! Striped a lock into buckets Hash Hash Hash Hash Looking-up hash table bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (Least Recently Used) LRU list always needs to be Database Database Files locked when it is accessed Files PostgreSQL 8.0 PostgreSQL 8.1 23
  • 24. Naive buffer management schemes Page requests Page requests Giant lock sucks! Striped a lock into buckets Hash Hash Hash Hash Looking-up hash table bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (Least Recently Used) LRU list always needs to be Database Database Files locked when it is accessed Files PostgreSQL 8.0 PostgreSQL 8.1  Did not scale at all  Scales up to 8 processors 24
  • 25. Less naive buffer management schemes Page requests Page requests Hash Hash Hash Hash Hash Hash Hash Hash bucket bucket bucket bucket bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (CLOCK) Always needs to be locked Database when it is accessed Database Files Files PostgreSQL 8.1 PostgreSQL 8.2 Scales up to 8 processors 25
  • 26. Less naive buffer management schemes Page requests CLOCK does not require a lock Page requests when an entry is touched Hash Hash Hash Hash Hash Hash Hash Hash bucket bucket bucket bucket bucket bucket bucket bucket misses hits misses hits Page replacement algorithm Page replacement algorithm (Least Recently Used) (CLOCK) Always needs to be locked Database when it is accessed Database Files Files PostgreSQL 8.1 PostgreSQL 8.2 Scales up to 8 processors  Scales up to 16 processors 26
  • 27. Outline• Background• Our approach – Non-Blocking Synchronization – Nb-GCLOCK• Experimental Evaluation• Related Work• Conclusion 27
  • 28. Core idea of our approach Previous approaches Our optimistic approach Request pages Request pages CPU Buffer Buffer Memory Manager Manager HDD Database files Database files 28
  • 29. Core idea of our approach Previous approaches Our optimistic approach ○Reducing disk I/Os × locks are contended Request pages Request pages CPU Buffer Buffer Memory Manager Manager HDD Database files Database files 29
  • 30. Core idea of our approach Previous approaches Our optimistic approach ○Reducing disk I/Os × locks are contended Request pages Request pages CPU Buffer Buffer Memory Manager Managerintuition HDD Database files Database files 30
  • 31. Core idea of our approach Previous approaches Our optimistic approach ○Reducing disk I/Os × locks are contended Request pages Request pages CPU Enough processors Buffer Buffer Memory Manager Manager Disk bandwidth is not utilized HDD Database files Database files 31
  • 32. Core idea of our approach Previous approaches Our optimistic approach ○Reducing disk I/Os × locks are contended Request pages Request pages CPU Enough processors Buffer Buffer Memory Manager Manager Disk bandwidth is not utilized HDD Database files Database files 32
  • 33. Core idea of our approach Previous approaches Our optimistic approach ○Reducing disk I/Os × locks are contended Request pages Request pages CPU Enough processors Buffer Buffer Memory Manager Manager Disk bandwidth is not utilized Reduced lock granularity to one CPU instruction and HDD remove the bottleneck Database files Database files 33
  • 34. Core idea of our approach Previous approaches Our optimistic approach ○Reducing disk I/Os △ # of I/O slightly increases × locks are contended ○ no contention on locks Request pages Request pages CPU Enough processors Buffer Buffer Memory Manager Manager Disk bandwidth is not utilized Reduced lock granularity to one CPU instruction and HDD remove the bottleneck Database files Database files 34
  • 35. Major Difference to Previous ApproachesPrevious approaches Our optimistic approach ○Reducing disk I/Os △ # of I/O slightly increases × locks are contended ○ no contention on locks Their goal is … 35
  • 36. Major Difference to Previous Approaches Previous approaches Our optimistic approach ○Reducing disk I/Os △ # of I/O slightly increases × locks are contended ○ no contention on locks Their goal is …Improve buffer hit-ratesfor reducing I/Os Unique goal for many decades. Is this goal valid for many core era? There are also SSDs. 36
  • 37. Major Difference to Previous Approaches Previous approaches Our optimistic approach ○Reducing disk I/Os △ # of I/O slightly increases × locks are contended ○ no contention on locks Their goal is … Our goal is …Improve buffer hit-ratesfor reducing I/Os Unique goal for many decades. Is this goal valid for many core era? There are also SSDs. 37
  • 38. Major Difference to Previous Approaches Previous approaches Our optimistic approach ○Reducing disk I/Os △ # of I/O slightly increases × locks are contended ○ no contention on locks Their goal is … Our goal is …Improve buffer hit-rates Improve throughputs byfor reducing I/Os utilizing (many) CPUs. Unique goal for many decades. Is this goal valid for many core era? There are also SSDs. 38
  • 39. Major Difference to Previous Approaches Previous approaches Our optimistic approach ○Reducing disk I/Os △ # of I/O slightly increases × locks are contended ○ no contention on locks Their goal is … Our goal is …Improve buffer hit-rates Improve throughputs byfor reducing I/Os utilizing (many) CPUs. Unique goal for many decades. Use Non-blocking synchronization Is this goal valid for many core instead of acquiring locks! era? There are also SSDs. 39
  • 40. What’s non-blocking and lock-free? Formally: 40
  • 41. What’s non-blocking and lock-free? Formally:  Stopping one thread will not prevent global progress. Individual threads make progress without waiting. 41
  • 42. What’s non-blocking and lock-free? Formally:  Stopping one thread will not prevent global progress. Individual threads make progress without waiting. Less Formally: 42
  • 43. What’s non-blocking and lock-free? Formally:  Stopping one thread will not prevent global progress. Individual threads make progress without waiting. Less Formally:  No thread locks any resource  No critical sections, locks, mutexs, spin-locks, etc 43
  • 44. What’s non-blocking and lock-free? Formally:  Stopping one thread will not prevent global progress. Individual threads make progress without waiting. Less Formally:  No thread locks any resource  No critical sections, locks, mutexs, spin-locks, etcLock-free if every successful step makes Global Progressand completes within finite time (ensuring liveness) 44
  • 45. What’s non-blocking and lock-free? Formally:  Stopping one thread will not prevent global progress. Individual threads make progress without waiting. Less Formally:  No thread locks any resource  No critical sections, locks, mutexs, spin-locks, etcLock-free if every successful step makes Global Progressand completes within finite time (ensuring liveness)Wait-free if every step makes Global Progressand completes within finite time (ensuring fairness) 45
  • 46. Non-blocking synchronizationSynchronization method that does not acquire any lock,enabling concurrent accesses to shared resources  Utilize atomic CPU primitives   Utilize memory barriers 46
  • 47. Non-blocking synchronizationSynchronization method that does not acquire any lock,enabling concurrent accesses to shared resources  Utilize atomic CPU primitives  CAS (compare-and-swap) cmpxchg on X86  Utilize memory barriers 47
  • 48. Non-blocking synchronizationSynchronization method that does not acquire any lock,enabling concurrent accesses to shared resources  Utilize atomic CPU primitives  CAS (compare-and-swap) cmpxchg on X86  Utilize memory barriers Blocking acquire_lock(lock); counter++; release_lock(lock); 48
  • 49. Non-blocking synchronizationSynchronization method that does not acquire any lock,enabling concurrent accesses to shared resources  Utilize atomic CPU primitives  CAS (compare-and-swap) cmpxchg on X86  Utilize memory barriers Blocking Non-Blocking acquire_lock(lock); int old; counter++; do { release_lock(lock); old = *counter; } while (!CAS(counter, old, old+1)); counter is incremented if the value was equals to old 49
  • 50. Making the buffer manager non-blocking Page requests Hash Hash Hash Hash bucket bucket bucket bucket misses hits Page replacement algorithm (GCLOCK) lock; lseek; read; unlock Database Files 50
  • 51. Making the buffer manager non-blocking Page requests 1. Utilized existing lock-free hash table Hash Hash Hash Hash bucket bucket bucket bucket misses hits Page replacement algorithm (GCLOCK) lock; lseek; read; unlock Database Files 51
  • 52. Making the buffer manager non-blocking Page requests 1. Utilized existing lock-free hash table Hash Hash Hash Hash bucket bucket bucket bucket misses hits Page replacement algorithm 2. Removing locks on cache (GCLOCK) misses (in fig. 6) lock; lseek; read; unlock Database Files 52
  • 53. Making the buffer manager non-blocking Page requests Hash Hash Hash Hash bucket bucket bucket bucket misses hits Page replacement algorithm (GCLOCK) lock; lseek; read; unlock Database Files 53
  • 54. Making the buffer manager non-blocking 3. Need to keep consistency Page requests between lookup hash table and GCLOCK (in the right half of fig. 3) Hash Hash Hash Hash bucket bucket bucket bucket misses hits Page replacement algorithm (GCLOCK) lock; lseek; read; unlock Database Files 54
  • 55. Making the buffer manager non-blocking 3. Need to keep consistency Page requests between lookup hash table and GCLOCK (in the right half of fig. 3) Hash Hash Hash Hash bucket bucket bucket bucket Reference in buffer lookup table misses hits still has a different page identifier immediately after changing the Page replacement algorithm page allocation of a buffer frame (GCLOCK) lock; lseek; read; unlock Database Files 55
  • 56. Making the buffer manager non-blocking 3. Need to keep consistency Page requests between lookup hash table and GCLOCK (in the right half of fig. 3) Hash Hash Hash Hash bucket bucket bucket bucket Reference in buffer lookup table misses hits still has a different page identifier immediately after changing the Page replacement algorithm page allocation of a buffer frame (GCLOCK) lock; lseek; read; unlock 4. Avoided locks on I/Os Database Files by utilizing pread, CAS, and memory barriers (in fig. 5) 56
  • 57. State Machine-based Reasoning for selecting replacement victim Construct algorithm from many steps ─ build a State Machine for ensuring glabal progress 57
  • 58. State Machine-based Reasoning for selecting replacement victim 58
  • 59. State Machine-based Reasoning for selecting replacement victim E: entry action evicted Fix in pool swapped Check whether Evicted E: CAS value success !null E: move the clock hand !evicted ! swapped Check whether evicted Pinned Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 59
  • 60. State Machine-based Reasoning for selecting replacement victim E: entry action evicted Fix in pool swapped Check whether Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 60
  • 61. State Machine-based Reasoning for selecting replacement victim E: entry action evicted Fix in pool swapped Check whether Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count of a buffer page 61
  • 62. State Machine-based Reasoning for selecting replacement victim Return a replacement E: entry action evicted victim Check whether Fix in pool swapped Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count of a buffer page 62
  • 63. State Machine-based Reasoning for selecting replacement victim Return a replacement E: entry action evicted victim Check whether Fix in pool swapped Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count Advance CLOCK hand of a buffer page (check the next candidate) 63
  • 64. State Machine-based Reasoning for selecting replacement victim Thread A Return a replacement E: entry action evicted victim Check whether Fix in pool swapped Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count Advance CLOCK hand of a buffer page (check the next candidate) 64
  • 65. State Machine-based Reasoning for selecting replacement victim Thread A Return a replacement E: entry action evicted victim Check whether Fix in pool swapped Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Thread B Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count Advance CLOCK hand of a buffer page (check the next candidate) 65
  • 66. State Machine-based Reasoning for selecting replacement victim Thread A Return a replacement E: entry action evicted victim Check whether Fix in pool swapped Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Thread B Pinned Oops! Candidatevictim Select a frame isTry to evict intercepted. E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count Advance CLOCK hand of a buffer page (check the next candidate) 66
  • 67. State Machine-based Reasoning for selecting replacement victim Thread A Return a replacement E: entry action evicted victim Check whether Fix in pool swapped Evicted E: CAS value success !null E: move theStart finding a ! swapped clock hand !evictedreplacement Check whether evicted Thread B Pinnedvictim Select a frame Try to evict E: evict !evicted pinned !pinned null --refcount<=0 Try to decrement continue the refcount E: decrement E: try next entry the refcount --refcount>0 Decrement weight count Advance CLOCK hand of a buffer page (check the next candidate) 67
  • 68. Outline• Background• Our approach – Non-Blocking Synchronization – Nb-GCLOCK• Experimental Evaluation• Related Work• Conclusion 68
  • 69. Experimental settings  Workload  Zipf 80/20 distribution (a famous power law) containing 20% of sequential scans dataset size is 32GB in total  Machine used: UltraSPARC T2 64 processors 69
  • 70. Experimental settings  Workload  Zipf 80/20 distribution (a famous power law) containing 20% of sequential scans dataset size is 32GB in total  Machine used: UltraSPARC T2 64 processors We also performed evaluation on various X86 settings in the paper. 70
  • 71. Performance comparison on moderate I/Os (of fig.9)Throughput(normalized by LRU) 6.0 LRU 5.0 GCLOCK 4.0 Nb-GCLOCK 3.0 2.0 1.0 0.0 8 16 32 64 Processors 71
  • 72. Performance comparison on moderate I/Os (of fig.9)Throughput(normalized by LRU) 6.0 LRU 5.0 GCLOCK 4.0 Nb-GCLOCK 3.0 2.0 1.0 CPU0.0 utilization  Previous approach: Low, about 20% 8 16 32 64 Processors  Nb-GCLOCK: High, more than 95% 72
  • 73. Performance comparison on moderate I/Os (of fig.9)Throughput More difference in CPU time can be(normalized by LRU) expected when # of CPU increases ➜ We expect more throughput 6.0 LRU 5.0 GCLOCK 4.0 Nb-GCLOCK 3.0 2.0 1.0 CPU0.0 utilization  Previous approach: Low, about 20% 8 16 32 64 Processors  Nb-GCLOCK: High, more than 95% 73
  • 74. Maximum throughput to processors Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm 74
  • 75. Maximum throughput to processors Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithmThroughput(log scale) 8 (1) 16 (2) 32 (4) 64 (8) Processors 2Q 890992 819975 866009 662782 GCLOCK 1758605 1912000 1931268 1817748 (cores) Nb-GCLOCK 3409819 7331722 14245524 25834449 75
  • 76. Maximum throughput to processors Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithmThroughput(log scale) Achieved almost linear scalability, at least, up to 64 processors!  This is the first attempt that removed locks in buffer management 8 (1) 16 (2) 32 (4) 64 (8) Processors 2Q 890992 819975 866009 662782 GCLOCK 1758605 1912000 1931268 1817748 (cores) Nb-GCLOCK 3409819 7331722 14245524 25834449 76
  • 77. Maximum throughput to processors Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithmThroughput(log scale) Achieved almost linear scalability, at least, up to 64 processors!  This is the first attempt that removed locks in buffer management 8 (1) 16 (2) 32 (4) 64 (8) Processors 2Q Interesting here is GCLOCK has662782 890992 819975 866009 CPU- GCLOCK scalability limit on around 16 1817748 1758605 1912000 1931268 (cores) Nb-GCLOCK 3409819 Caching solutions 25834449 processors. 7331722 14245524 using GCLOCK have scalability limit there. 77
  • 78. Max thoughput (operation/sec) evaluation Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs) Accesses issued from 64 threads in 60 seconds  Thus, ideally 64 x 60 = 3,840 seconds can be used 78
  • 79. Max thoughput (operation/sec) evaluation Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs) Accesses issued from 64 threads in 60 seconds  Thus, ideally 64 x 60 = 3,840 seconds can be used 79
  • 80. Max thoughput (operation/sec) evaluation Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs) Accesses issued from 64 threads in 60 seconds  Thus, ideally 64 x 60 = 3,840 seconds can be used Most of CPU time is used because our Nb-GCLOCK is non-blocking! 80
  • 81. Max thoughput (operation/sec) evaluation Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs) Accesses issued from 64 threads in 60 seconds  Thus, ideally 64 x 60 = 3,840 seconds can be used About 10-20% of CPU Time is used! Most of CPU time is used because our Nb-GCLOCK is non-blocking! 81
  • 82. Max thoughput (operation/sec) evaluation Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs) Accesses issued from 64 threads in 60 seconds  Thus, ideally 64 x 60 = 3,840 seconds can be used About 10-20% of CPU Time is used! Most of CPU time is used because our Nb-GCLOCK is non-blocking! The CPU utilization would be more differs when # of processors grows. It would causes contentions! 82
  • 83. TPC-C evaluation using Apache Derby 1400 1300Transactionper minutes 1200 tpmC 1100 Derby 1000 Nb-GCLOCK 900 800 8 16 32 64 128 # of terminals (threads)Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001. 83
  • 84. TPC-C evaluation using Apache Derby 1400 1300Transactionper minutes 1200 tpmC 1100 Derby 1000 Nb-GCLOCK 900 800 8 16 32 64 128 The original scheme of Derby (CLOCK) decreased throughput.#On the other hand, of terminals (threads) ours scheme showed better result.Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001. 84
  • 85. TPC-C evaluation using Apache Derby Throughput to buffer management module reduced a latch on root page of B+-tree ➜ We would require a concurrent B+-tree (see OLFIT) 1400 1300Transactionper minutes 1200 tpmC 1100 Derby 1000 Nb-GCLOCK 900 800 8 16 32 64 128 # of terminals (threads)Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001. 85
  • 86. Outline• Background• Our approach – Non-Blocking Synchronization – Nb-GCLOCK• Experimental Evaluation• Related Work• Conclusion 86
  • 87. Xiaoning Ding, Song Jiang, and Xiaodong Zhang:Bp-wrapper Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009. eliminates lock contention on buffer hits Page requests by using a batching and prefetching technique Hash Hash Hash Hash bucket bucket bucket bucket hitsmisses Recording accessPage replacement algorithm (any) Database Files 87
  • 88. Xiaoning Ding, Song Jiang, and Xiaodong Zhang:Bp-wrapper Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009. eliminates lock contention on buffer hits Page requests by using a batching and prefetching technique Hash Hash Hash Hash postpones the physical work bucket bucket bucket bucket (adjusting the buffer replacement list) hits and immediately returnsmisses the logical operation Recording access called Lazy synchronization in the literaturePage replacement algorithm (any) Database Files 88
  • 89. Xiaoning Ding, Song Jiang, and Xiaodong Zhang:Bp-wrapper Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009. eliminates lock contention on buffer hits Page requests by using a batching and prefetching technique Hash Hash Hash Hash postpones the physical work bucket bucket bucket bucket (adjusting the buffer replacement list) hits and immediately returnsmisses the logical operation Recording access called Lazy synchronization in the literature Pros.Page replacement algorithm - works with any page replacement algorithm (any) Cons. - Does not increase throughputs of CLOCK variants because CLOCK does not require locks on buffer hits Database - Cache misses involve batching Files larger lock holding time makes more contentions 89
  • 90. Conclusions Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK. Linearizability and lock-freedom are proven in the paper 90
  • 91. Conclusions Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.  almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors  The first attempt that introduce non-blocking synchronization to database buffer management  Optimistic I/Os using pread, CAS and memory barriers Linearizability and lock-freedom are proven in the paper 91
  • 92. Conclusions Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.  almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors  The first attempt that introduce non-blocking synchronization to database buffer management  Optimistic I/Os using pread, CAS and memory barriers Linearizability and lock-freedom are proven in the paper  The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress. 92
  • 93. Conclusions Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.  almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors  The first attempt that introduce non-blocking synchronization to database buffer management  Optimistic I/Os using pread, CAS and memory barriers Linearizability and lock-freedom are proven in the paper  The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress. This work is also useful for any caching solution that requires high throughput (e.g., C10K accesses) 93
  • 94. Thank you for your attention! 94

×