• Save
Database Research on Modern Computing Architecture
Upcoming SlideShare
Loading in...5
×
 

Database Research on Modern Computing Architecture

on

  • 2,027 views

A brief introduction of db research for harnessing the characteristics of modern hardware.

A brief introduction of db research for harnessing the characteristics of modern hardware.

Statistics

Views

Total Views
2,027
Views on SlideShare
2,008
Embed Views
19

Actions

Likes
3
Downloads
6
Comments
0

4 Embeds 19

http://www.techgig.com 14
http://www.techgig.timesjobs.com 3
http://www.m.techgig.com 1
http://115.112.206.131 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Database Research on Modern Computing Architecture Database Research on Modern Computing Architecture Presentation Transcript

  • Database Research on Modern Computing Architecture September 10, 2010 Kyong-Ha Lee (bart7449@gmail.com) Department of Computer Science KAIST, Daejeon, Korea
  • Brief Overview of This Talk  Basic theories and principles about database technology on modern HW • Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions  Topics • The immense changes in computer architecture • A variety of computing sources • Intra-node parallelism • The DB technology that facilitates modern HW features Invited talk @ ETRI, © Kyong-Ha Lee 2
  • Things we have now in our PC Core 1 16 Integer Throughput ~1 instruction per cycle. registers One cycle takes ~0.33 ns. Core 2 16 Double The exact # of cycles depends on the in FP registers struction L1 D-cache L1 TLB L1 D-cache 32KB L1 I-cache 128 entries 32KB L1 I-cache L1 Latency 1ns 32KB for I 256 en Latency 1ns 32KB TLB (3 cycles) tries for D (3 cycles) L2 Unified Cache Intel Core 2 Duo L2 TLB 6MB 3.0GHz, E8400 Latency 4.7ns (14 cycles) Wolfdale Front Side Bus 1,333MHz Bandwidth: 10GB/S DDR3 Ram Modules Intel X48 PCI Express 2.0 x16, 8GB/s (e 4 GB ach way) Northbridge Latency: ~83ns(~250 c Chip Invited talk @ ETRI, © Kyong-Ha Lee ycles) 3
  • DMI Interface Bandwidth: 1GB/s (each way) USB 2.0 ~30MB/s Serial ATA port 300MB/s FireWire 800 ~55MB/s PCIe 2.0 x1, 500MB/s (each way) Seagate 1TB 7,200 RPM Wireless 802.11g ~2.5 MB/s 32MB HDD Cache Gigabit wired ethernet. ~100 MB/s Operates at SATA rate Intel ICH9R Southbridge My LGU+ cable line chip Sustained disk I/O ~138Mb/S 100Mb/s up/down Random seek time: 8.5ms (read)/9.5ms(write) 25.7 million/28.8 million Internet cycles original source: http://duartes.org/gustavo/blog/post/what-your- computer-does-while-you-wait Invited talk @ ETRI, © Kyong-Ha Lee 4
  • So what‘s happening now?  Changes in memory hierarchy • Higher capacity and the emergence of Non- Volatile RAM(NVRAM)  Memory wall and multi-level caches • Latency lags bandwidth  Increasing number of cores in a single die • Multicore or CMP  A variety of computing sources • CPU, GPU, NP and FPGA  Intra-/Inter-node parallelism • CMP vs. MPP Invited talk @ ETRI, © Kyong-Ha Lee 5
  • Now In Memory Hierarchy  Very cheap HDD with VERY high capacity • Seagate 1TB Barracuda 3.5‘ HDD (7200rpm, 32MB) for 74,320 won($61.94) in Aug 2010 • 1GB for 74.3 won  Write-once storage • Tape drive is dead, ODD is waning • Due to the poor latency and seek time • Seek time >= 100ms • although 22X DVD writer can sequentially write 4.7GBytes within 3 minutes (29.7MB in theory)  1GB for 53.82 won  Price of RAM has fallen enough to keep much more data in memory than before. • A 4GB DDR3 Memory(1,333MHz) for 108,000 won • 1 GB for 27,000 won ($22.5) • but, still cost_m >> cost_d Invited talk @ ETRI, © Kyong-Ha Lee 6
  • The Five-Minute Rule  Cache randomly accessed disk pages that are reused every 5 minutes[1]. • BreakEvenIntervalInSeconds  PagesPerMBofRAM Pr icePerDisk Drive  AccessPerS econdPerDi sk Pr icePerMBof RAM  In 1987, breakeven interval was ~2 minutes  After that, ~5 minutes in 1997, ~88 minutes in 2007.  “Memory becomes HDD, HDD becomes Tape, and Tape is dead”, by Jim Gray  Today‘s memory is ~102,400 times faster than HDD • Memory : 83 ns(250 cycles) • HDD : 8.5ms (25.7 million cycles)  (256/116) x (61.94/0.0225) = ~101 minutes.  => Cache your data in memory as always as possible. Invited talk @ ETRI, © Kyong-Ha Lee 7
  • Latency lags bandwidth  From 1983 to 2003[2] • Capacity increased ~ 2,500 times (0.03GB -> 73.4GB) • Bandwidth improved 143.3 times (0.6 MB/s -> 86 MB/s) • Latency improved 8.5 times (48.3 -> 5.7 ms)  Why? • Moore‘s law helps bandwidth more than latency • Distance limits latency • Bandwidth is generally easier to sell • Latency helps bandwidth but not vice versa.(e.g., spinning disk faster) • Bandwidth hurts latency(e.g., buffer) • OS overhead hurts latency Invited talk @ ETRI, © Kyong-Ha Lee 8
  • Latency vs. Bandwidth  Latency can be handled by • Hiding (or tolerating) it – out of order issue, non blocking cache, prefetching • Reducing it –better cache  Parallelism sometimes helps to hide latency • MLP(Memory Level Parallelism) - multiple outstanding cache misses overlapped • But increased bandwidth demand  Latency ultimately limited by physics  Bandwidth can be handled by ―spending ― more (HW cost) • Wider buses, interface, Interleaving  Bandwidth improvement usually increases latency • No free lunch  Hierarchies decreases bandwidth demand to lower levels. • Serve as traffic filters: a hit in L1 is filtered from L2 • If average bandwidth is not met -> infinite queues Invited talk @ ETRI, © Kyong-Ha Lee 9
  • NVRAM Storage: Solid Sate Disk  Intel X25-M Mainstream(50nm) 160GB • Read/write latency 85/115 us • Random 4KB read/write: 35K/3.3K IOPS • Sustained sequential read/write: 250/70MB/s • 1GB for 3,619 won in Aug 2010  SSD has successfully occupied the position between memory and HDD • best suited for sequential read/write • e.g., logging device Invited talk @ ETRI, © Kyong-Ha Lee 10
  • Features of SSD  No mechanical latency • Flash memory is an electronic device with no moving parts • Provides uniform random access speed without seek/rotational latency  No in-place update • No data on a page can be updated in place before erasing it first • An erase unit (or block) is much larger than a page  Limited lifetime • MLC : 0.1M times of writes, SLC : 1M times of writes • Wear-leveling  Asymmetric read & write speed • Read speed is typically at least 3X faster than write speed • Write (and erase) optimization is critical  Asymmetric seq. vs. random I/O performance • Random 4KB read/write: 35K/3.3K IOPS • 140MB/13.2MB in total size • Sustained sequential read/write: 250/70MB/s  ―Disk‖ Abstraction • LBA(or LPA) -> (channel#, plane#, … ) or just PBA(or PPA) • This mapping changes each time a page write is performed • The controller must maintain a mapping table in RAM or Flash Invited talk @ ETRI, © Kyong-Ha Lee 11
  • Memory wall  Latencies • CPU stalls because of time spent for memory access • latency for memory access: 250 cycles • ~249 instructions are blocked waiting data from the memory access.  Solution: CPU Caching!! Invited talk @ ETRI, © Kyong-Ha Lee 12
  • Why Caching?  Processor speeds are projected to increase about 70% per year for many years to com. This trend will widen the speed gap btw. Memory and processor caches. The caches will get larger, but memory speed will not keep pace with processor speeds  Low latent memory that hides memory access latency • Static RAM vs. Dynamic RAM • 3ns(L1)~14ns(L2) vs. 83ns  Small capacity with support of locality • Temporal locality • Recently referenced items are likely to be referenced in the near future • Capacity limits the # of items to be kept in the cache at a time • L1$ in Intel Core i7 is 32KB • Spatial locality • Items with nearby addressed tend to be referenced close together in time • the size of one cache line • e.g., a cache line size in Intel Core i7 is 64B • So 32K/64B = 512 cache lines Invited talk @ ETRI, © Kyong-Ha Lee 13
  • An Example Memory Hierarchy Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 14
  • Memory Mountain in 2000 32B cache line size Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 15
  • Memory Mountain in 2010 Intel Core i7 2.67GHz 32KB L1 d-cache 256KB L2 cache 8MB L3 cache 64B cache line size source: http://csapp.cs.cmu.edu/public/perspective.html Invited talk @ ETRI, © Kyong-Ha Lee 16
  • CPU Cache Structure Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 17
  • Addressing Caches Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 18
  • Types of Cache Misses  Cold miss(or compulsive miss) • Data are not loaded at first.  Capacity miss • Because of limited capacity • must evict a victim to make space for replacement block • LRU or LFU  Conflict miss • involves cache thrashing • can be alleviated by associative cache • e.g., 8-way set associative cache in Core2 Duo  Coherence miss • Data consistency between caches Invited talk @ ETRI, © Kyong-Ha Lee 19
  • Cache Performance  Metrics • Miss rate: # of misses/# of references • The fraction of memory references during the execution of a program. • Hit rate :# of success/#of references • Hit time: the time to deliver a word in the cache to the CPU • Miss penalty: any additional time required because of a miss.  Impact of : • Cache size: reduce capacity miss and increase both of hit rate and hit time • Cache line size: increase spatial locality and decrease temporal locality • Associativity • Full-associative : no conflict miss, but linear scan of cache lines eventually • Direct-mapping: conflict miss Invited talk @ ETRI, © Kyong-Ha Lee 20
  • Writing Cache-Friendly Codes  Maximizes two localities in your program • Remove pointers as many as possible • Increasing both spatial locality and update cost • fit the working data into a cache line and into the capacity of the cache • Increasing spatial and temporal locality • Use working data as often as possible once it has been read from memory  Software prefetching • Removing cold miss rates Invited talk @ ETRI, © Kyong-Ha Lee 21
  • Example: Matrix Multiplication *Assumptions: •Row-major order •Cache block = 8 doubles •Cache size C << n Invited talk @ ETRI, © Kyong-Ha Leeblocks fit into cache 3B^222 C •Three <
  • SW Prefetching  Loop unrolling • for (int i=0; i < N-4; i+=4){ //inner product of double a[] and b[] prefetch(&a[i+4] );//32-bit machine with 32B cache line size prefetch(&b[i+4]); ip = ip + a[i]*b[i]; a[i] a[i+1] a[i+2] a[i+3] ip = ip + a[i+1] * b[i+1]; b[i] b[i+1] b[i+2] b[i+3] ip = ip + a[i+2]* b[i+2]; ip = ip +a[i+3]* b[i+3]; }  Data linearization 1 preorder traverse 2 3 1 2 4 5 3 6 7 4 5 6 7 Invited talk @ ETRI, © Kyong-Ha Lee 23
  • Optimizations in Modern Microprocessor  Pipelining (Intel i486) • utilizes ILP(Instruction Level Parallelism) • increases throughput but not latency.  Out-of-order execution(Intel P6) • 96-sized inst. window(Core2), 128- sized inst. window(Nehalem) • in-order processor(Intel Atom, GPU)  Superscalar(Intel P5) • 3-wide(Core2) , 4-wide(Nehalem) Invited talk @ ETRI, © Kyong-Ha Lee 24
  •  Simultaneous Multi-threading(from Intel Pentium4) • TLP(Thread-Level Parallelism) • Hardware multi-threading • Support of HW-level context switching • issues multiple instructions from multiple threads in one cycle. • HT(Hyper Threading) is Intel‘s term for SMT  SIMD(Single Instruction Multiple Data)(Intel Pentium III) • DLP(Data-Level Parallelism) • 128bit SSE(Streaming SIMD Extensions) for x86 architecture Invited talk @ ETRI, © Kyong-Ha Lee 25
  •  Branch prediction and speculative execution • guess which way a branch will go before this is known for sure • To improve the flow in the ILP  Hardware prefetching • Hiding latency by fetching data from memory in advance • Advantage • No need to add any instruction overhead to issue prefetches • No SW cost • Disadvantage • Cache pollution • Bandwidth can be wasted • H/W cost and compatibility Invited talk @ ETRI, © Kyong-Ha Lee 26
  • Speed of a Program  CPI(Clock Per Instruction) vs. IPC(Instructions Per Clock)  MIPS(Million Instructions Per Second) • FLOPS(Floating Point Per Second) • GFLOPS, TFLOPS  T = N x CPI x T_cycle  Improvement • Reduce the # of instructions • Reduce CPI • Increase clock speed Invited talk @ ETRI, © Kyong-Ha Lee 27
  • Virtuous Cycle, circa 1950 – 2005 Increased processor performance Larger, more Slower feature-full programs software Higher-level Larger languages & development abstractions teams World-Wide Software Market (per IDC): $212b (2005)  $310b (2010) Invited talk @ ETRI, © Kyong-Ha Lee 28
  • Virtuous Cycle, circa 2005-?? Slower programs X Increased processor performance Larger, more feature-full software GAME OVER — NEXT LEVEL? Threadlanguages & Parallelism & Level Higher-level Larger development abstractions teams Multicore Chips Invited talk @ ETRI, © Kyong-Ha Lee 29
  • CMP(Chip Level Multiprocessor)  Apple Inc. starts to sell 12-core Mac Pro(in Aug 2010) • ―The new Mac Pro offers two advanced processor options from Intel. The Quad-Core Intel Xeon ― Nehalem‖ processor is available in a single-processor, quad-core configuration at speeds up to 3.2GHz. For even greater speed and power, choose the ―Westmere‖ series, Intel‘s next-generation processor based on its latest 32-nm process technology. ‖Westmere‖ is available in both quad-core and 6-core versions, and the Mac Pro comes with either one or two processors. Which means that you can have a 6-core Mac Pro at 3.33GHz, an 8- core system at 2.4GHz, or, to max out your performance, a 12-core system at up to 2.93GHz.‖ from Apple homepage Invited talk @ ETRI, © Kyong-Ha Lee 30
  • Multicore  Moore‘s law is still valid • ―The # of transistors on an integrated circuit has doubled approximately every other year.‖- Core Gordon E. Moore, 1965 Shared  Obstacles to increasing clock speed L2 Cache • Power density problem • ―Can soon put more transistors Core on a chip than can afford to turn Intel on‖ – Patterson‘07 Core2 Duo • Heat problem • e.g., Intel Pentium IV Prescott (3.7GHz) in 2004  Limits in Instruction Level Parallelism(ILP) => The emergence of Multicore !! Intel Core i7 Invited talk @ ETRI, © Kyong-Ha Lee 31
  • Chip density is continuin g increase ~2x every 2 ye ars Clock speed is not incre asing Number of processor co res may double instead There is little or no hidde n parallelism (ILP) to be f ound Parallelism must be expo sed to and managed by so ftware Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Invited talk @ ETRI, © Kyong-Ha Lee 32
  • Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‗07 Scaling clock speed (business as usual) will not work 10000 Sun‘s Surface Rocket Power Density (W/cm2) 1000 Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 8085 386 Pentium® 286 486 8080 Source: Patrick 1 Gelsinger, Intel 1970 1980 1990 2000 2010 Year Invited talk @ ETRI, © Kyong-Ha Lee 33
  • Parallelism Saves Power  Exploit explicit parallelism for reducing power Power = C **V222/4 F)/4 (C * V **FF F/2 2C V * * Performance = Cores **FF (Cores * F)*1 2Cores F/2 Capacitance Voltage Frequency • Using additional cores – Increase density (= more transistors = more capacita nce) – Can increase cores (2x) and performance (2x) – Or increase cores (2x), but decrease frequency (1/2): s ame performance at ¼ the power • Additional benefits – Small/simple cores  more predictable performance Invited talk @ ETRI, © Kyong-Ha Lee 34
  • Amdahl‘s law  Two basic metrics • •  Recall Amdahl‘s law [1967] • Simple SW assumption • No Overhead for • Scheduling, communication, synchronization, and etc  • e.g.,  Invited talk @ ETRI, © Kyong-Ha Lee 35
  • Types of multicore  Symmetric multicore • e.g., Core 2Duo, i5, i7, Xeon octo-core  Assume that • Each Chip Bounded to N BCEs (for all cores) • Each Core consumes R BCEs • Assume Symmetric Multicore = All Cores Identical • Therefore, N/R Cores per Chip — (N/R)*R = N • For an N = 16 BCE Chip: Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core Invited talk @ ETRI, © Kyong-Ha Lee 36
  • Performance of Symmetric Multicore Chips  Serial Fraction 1-F uses 1 core at rate Perf(R)  Serial time = (1 – F) / Perf(R)  Parallel Fraction uses N/R cores at rate Perf(R) each  Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N  Therefore, w.r.t. one base core: 1 Symmetric Speedup = F*R 1-F Perf(R) + Perf(R)*N  Implications? Enhanced Cores speed Serial & Parallel Invited talk @ ETRI, © Kyong-Ha Lee 37
  • Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 14 F=0.99 F1, R=1, Cores=16, Speedup16 Symmetric Speedup 12 F=0.975 10 8 F=0.9 6 F=0.9, R=2, Cores=8, Speedup=6.7 4 F=0.5 2 0 1 2 4 8 16 (16 cores) R BCEs (8 cores) (2 cores) (1 core) (4 cores) F matters: Amdahl‘s Law applies to multicore chips MANY Researchers should target parallelism F first As Moore‘s Law increases N, often need enhanced core designs Some arch. researchers target on single-core performance Invited talk @ ETRI, © Kyong-Ha Lee 38
  •  Asymmetric multicore • Cell Broadband Engine in PS3 • 1 PPE(Power Processor Element) and 8 SPE(Synergic Processor Element)  Each Chip Bounded to N BCEs (for all cores)  One R-BCE Core leaves N-R BCEs  Use N-R BCEs for N-R Base Cores  Therefore, 1 + N - R Cores per Chip  For an N = 16 BCE Chip: Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core & Twelve 1-BCE base cores Invited talk @ ETRI, © Kyong-Ha Lee 39
  • Performance of Asymmetric Multicore Chips  Serial Fraction 1-F same, so time = (1 – F) / Perf(R)  Parallel Fraction F • One core at rate Perf(R) • N-R cores at rate 1 • Parallel time = F / (Perf(R) + N - R)  Therefore, w.r.t. one base core: 1 Asymmetric Speedup = F 1-F Perf(R) + Perf(R) + N - R Invited talk @ ETRI, © Kyong-Ha Lee 40
  • Asymmetric Multicore Chip, N = 256 BCEs 250 Asymmetric Speedup F=0.999 200 150 F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 (256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core) (1+240 cores) Number of Cores = 1 (Enhanced) + 256 – R (Base) How do Asymmetric & Symmetric speedups compare? Invited talk @ ETRI, © Kyong-Ha Lee 41
  • Other laws  Gustafson‘s law • ,α = 1-f  Karp-Fratt metric • efficient to estimate serial fraction from the real code • Invited talk @ ETRI, © Kyong-Ha Lee 42
  • Multicore Makes The Memory Wall  Problembandwidth to access memory Assume that each core requires 2GB/s Worse  What if 6 cores access the memory at a time? => 12GB >> FSB bandwidth  A prefetching scheme that is appropriate for a uniprocessor may be entirely inappropriate for a multiprocessor [22]. 5.0E+09 Total CPU cycles of a query 4.0E+09 Memory 3.0E+09 DTLB miss L2 hit Branch Misprediction 2.0E+09 Computation 1.0E+09 0.0E+00 1 core 8 cores http://spectrum.ieee.org/computing/hard ware/multicore-is-bad-news-for-supercom puters Solution : Sharing memory access Invited talk @ ETRI, © Kyong-Ha Lee 43
  • GPU  DLP(Data-Level Parallelism)  GPU has become a powerful computing engine behind scientific computing and data- intensive applications  Many light-weighted in-order cores  has separated caches and memories  GPGPU applications are data-intensive, handling long-running kernel execution(10- 1,000s of ms) and large data units ( 1-100s of MB) Invited talk @ ETRI, © Kyong-Ha Lee 44
  • GPU Architecture NVIDIA GTX 512  16 Streaming Multiprocessors(SM), each of which consists of 32 Stream Processors(SPs), resulting in 512 cores in total.  All threads running on SPs share the same program called kernel  An SM works as an independent SIMT processor. Invited talk @ ETRI, © Kyong-Ha Lee 45
  • Levels of Parallel Granularity and Memory sharing  A thread block is a batch of threads that can cooperate with each other by: • Synchronizing their execution • For hazard-free shared memory accesses • Efficiently sharing data through a low latency shared- memory  Two threads from two different blocks cannot cooperate Invited talk @ ETRI, © Kyong-Ha Lee 46
  • Four Execution Steps  The DMA controller transfers data from host(CPU) memory to device(GPU) memory  A host program instructs the GPU to launch the kernel  The GPU executes threads in parallel  The DMA controller transfers result from device memory to host memory  Warp; a basic execution(or scheduling) unit of SM, a group of 32 threads sharing the same instruction pointer; all threads in a warp take the same code path. Invited talk @ ETRI, © Kyong-Ha Lee 47
  • Comparison with CPU • It maximizes ILP to • It maximizes thread-level accelerate a small # of parallelism threads • It devotes most of their die • large caches and area to a large array of sophisticated control planes ALUs. for advanced features • e.g., superscalar, OoO • Memory stall can be execution, branch prediction, effectively minimized with and speculative loads an enough number of • Latency hiding is limited by threads CPU resources • Large memory • Limited memory bandwidth(177.4GB/s for bandwidth(32GB/s for GTX480) X5550) CPU GPU Invited talk @ ETRI, © Kyong-Ha Lee 48
  • GPU Programming Considerations  What to offload • Computation and memory-intensive algorithms with high regularity suit well for GPU acceleration  How to parallelize  Data structure usage • Simple data structure such as arrays are recommended.  Divergency in GPU code • SIMT demands to have minimal code-path divergence caused by data-dependent conditional branches within a warp  Expensive host-device memory cost Invited talk @ ETRI, © Kyong-Ha Lee 49
  • FPGA(Field Programmable Gate Array)  Von Neumann architecture vs. Hardware architecture  Integrated circuit designed to be configured by customer.  configuration is specified using a HDL(Hardware Description Language) Invited talk @ ETRI, © Kyong-Ha Lee 50
  • Limitations of FPGA  Area/speed tradeoff • Finite CLB on a single die • Becomes slower and more power- consumptive as logic becomes more complex  Act as a hard-wired once it is cooked  No support of recursion calls  Asynchronous design  Less power efficient Invited talk @ ETRI, © Kyong-Ha Lee 51
  • Are DB execution cache-friendly?  DB Execution Time Breakdown (in 2005)  At least 50% cycles on stalls.  Memory access is major bottleneck  Branch mispredictions increase cache misses Invited talk @ ETRI, © Kyong-Ha Lee 52
  • Modern DB techniques • Cache-conscious • CMP and multithreading • Cache-friendly data • Memory scan sharing placement • Staged DB execution • Data cache • GPGPU • Cache-conscious data structure • SIMT • Buffering index structure • FPGA • Hiding latency using • Von-Neumann vs. HW Prefetching • Cache-conscious join circuit • Instruction cache • Buffering • Staged database execution • Branch prediction • Reduce branches and SIMD Invited talk @ ETRI, © Kyong-Ha Lee 53
  • Record Layout Schemes Select name f PAX optimizes cache-to-memory communication but rom R where retains NSM‘s IO (page contents do not change) age > 50 (a) NSM(N-ary Storage Model) (b) DSM(Decomposed Storage Model) or Column-based (c)PAX(Partition Attribute Across) Invited talk @ ETRI, © Kyong-Ha Lee 54
  • Main-Memory Tree Indexes  T-tree: Balanced-binary tree proposed in 1986 for MMDB • Aim: balance space overhead with searching time.  Main-memory B+-trees: better cache performance[4]  Node width = cache line size (32-128B)  Minimize number of cache misses for search  Much higher than traditional disk-based B+-tree => more cache miss  How the shallow B+-tree? Invited talk @ ETRI, © Kyong-Ha Lee 55
  • Cache Sensitive B +-tree  Layout child nodes contiguously  Eliminate all but one child pointers • keys in one node fit in one cache line • Removing pointers increases the fanout of the tree, which results in a reduced tree height • 35% faster tree lookups • Update performance is 30% worse (splits) Invited talk @ ETRI, © Kyong-Ha Lee 56
  • Buffering Index Structures  buffering accesses to the index structure to avoid cache thrashing  Nodes in the index tree are grouped together into pieces that fit within the cache  Increase temporal locality but accesses can be delayed Invited talk @ ETRI, © Kyong-Ha Lee 57
  • Prefetching B+-tree  Idea: Larger nodes + prefetching  Node size = multiple cache lines (e.g., 8 lines)  Prefetch all lines of a node before search it  Cost to access a node only increases slightly  Much shallower tree, no changes required  Improves both search and update performance Invited talk @ ETRI, © Kyong-Ha Lee 58
  • Fractal pB+-tree  For faster range scan • Leaf parent nodes contain addresses of all leaves • Link leaf parent nodes together • Use this structure for prefetching leaf nodes  * A prefetching scheme that is appropriate for a uniprocessor may be entirely inappropriate for a multiprocessor [22]. Invited talk @ ETRI, © Kyong-Ha Lee 59
  • Cache-Conscious Hash Join  For good temporal locality, two relations to be joined are partitioned into partitions that fit in the data cache.  To reduce TLB misses caused by big H, use radix hash • In the cluster, # of random accesses is low • a large number of clusters can be created by making multiple passes through the data Invited talk @ ETRI, © Kyong-Ha Lee 60
  • Group prefetching Invited talk @ ETRI, © Kyong-Ha Lee 61
  • a group Invited talk @ ETRI, © Kyong-Ha Lee 62
  • Buffering tuples btw. operators  group consecutive operators into execution groups whose operators fit into the L1 I-cache.  buffering output of the execution group  I-Cache misses are amortized over multiple tuples and i-cache thrashing is avoided Invited talk @ ETRI, © Kyong-Ha Lee 63
  • How SMTs can help DB performance  Bi-threaded: partition input, cooperative threads  Work-ahead-set: main thread + helper thread • Main thread posts ―work-ahead set‖ to a queue • Helper thread issues load instructions for the requests Invited talk @ ETRI, © Kyong-Ha Lee 64
  • Staged Database Execution Model  TX may be divided into stages that fit in the L1 I-cache  When one tx reaches the end of stage, system switches context to a different thread that needs to execute the same stage. Stage S0 LOAD X LOAD X STORE Y STORE Y STORE Y STORE Y Stage S1 LOAD Y LOAD Y …. …. STORE Z STORE Z LOAD Z …. Stage S2 LOAD Z …. Invited talk @ ETRI, © Kyong-Ha Lee 65
  • Stage Spawning LOAD X LOAD Y LOAD Z S0 STORE Y S1 …. S2 …. STORE Y STORE Z Core 0 Core 1 Core 2 Work-queues Instances Instances Instances of S0 of S1 of S2 Invited talk @ ETRI, © Kyong-Ha Lee 66
  • Main-Memory Scan Sharing •Memory scan sharing also increases temporal locality •Too many sharing can cause cache thrashing Invited talk @ ETRI, © Kyong-Ha Lee 67
  • Summary  Latency is a major problem  Cache-friendly programming is indispensible  Chip level multiprocessor requires to be used for TLP  Facilitating diverse computing sources is a challenge Invited talk @ ETRI, © Kyong-Ha Lee 68
  • Further readings 1. Jim Gray, Gianfrano R. Putzolu, The 5 Minute Rule for Trading Memory for Disk Accesses and The 10 Byte Rule for Trading Memory for CPU Time, SIGMOD 1987: 395-398 2. David A. Patterson, Latency lags bandwidth, CACM, Vol. 47, No. 10 pp. 71—75, 2004 3. Mark Hill and et al., Amdahl‘s law in multicore era, IEEE Computer, Vol. 41, No. 7 pp. 33-38, 2008 4. J. Rao and et al., Cache Conscious Indexing for Decision-Support in Main Memory 5. P. A. Boncz and et al., Breaking the Memory wall in monetDB, CACM, Dec 2008 6. Shimin Chen and et al., Improving Hash Join Performance through Prefetching, ICDE 2004 7. Jingren Zhou and et al., Implementing Database Operations Using SIMD instructions, SIGMOD 2002 8. J. Cieslewicz and K.A. Ross, Database Optimizations for Modern Hardware, Proceedings of the IEEE 96(5), 2009 9. Lawrence Sparcklen and et al., Chip Multithreading: Opportunities and Challenges 10. Nikos Hardavellas and et al., Database Servers on Chip Multiprocessors: Limitations and Opportunities, CIDR 2007 11. Lin Qiao and et al., Main-Memory Scan Sharing For Multi-Core CPUs, PVLDB 2008 12. Ryan Johnson and et al., To Share or Not to Share?, VLDB 2007 13. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB 2009 14. Database Architectures for New Hardware, Tutorial, in the 30th VLDB, 2004 and in the 21st ICDE 2005 15. Query Co-processing on Commodity Processors, Tutorial in the 22nd ICDE 2006. 16. John Nickolls and et al., GPU Computing Era, IEEE Micro March/April 2010 17. Kayvon Fatahalian and et al., A Closer Look at GPUs, CACM Vol. 51, No.10, 2008 Invited talk @ ETRI, © Kyong-Ha Lee 69
  • 18. John Nickolls and et al., Scalable Parallel Programming, March/April ACM Queue, 2008 19. N.K. GOvindaraju and et al., GPUTeraSort: High performance graphics co-processor sorting for large database management, SIGMOD 2006 20. A. Mitra and et al., Boosting XML Filtering with a Scalable FPGA-based Architecture, CIDR 2009 21. S. Harizopoulos and A. Ailamaki and et al., Improving instruction cache performance in OLTP, ACM TODS, vol. 31, pp. 887-920 22. T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991 *Courses available on Internet  Introduction to Computer Systems @CMU, 2000~2010 • http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f10/www/index.html  Multicore Programming Primer @MIT, 2007 (with video) • http://groups.csail.mit.edu/cag/ps3/index.shtml  Introduction to Multiprocessor Synchronization @Brown • http://www.cs.brown.edu/courses/cs176  Parallel Programming for Multicore @Berkeley, Spring 2007 • http://www.cs.berkeley.edu/~yelick/cs194f07/  Applications of Parallel Computing @Berkeley, Spring 2007 • http://www.cs.berkeley.edu/~yelick/cs267_sp07/  High-Performance Computing for Applications in Engineering @Wisc, Autumn 2008 • http://sbel.wisc.edu/Courses/ME964/2008/index.htm  High Performance Computing Training @Lawrence Livermore National Laboratory • https://computing.llnl.gov/?set=training&page=index  Programming Massively Parallel Processors with CUDA @Stanford, Spring 2010 (with video) • on Itunes U and Youtube.com Invited talk @ ETRI, © Kyong-Ha Lee 70