SlideShare a Scribd company logo
1 of 70
Database Research on
 Modern Computing
    Architecture
         September 10, 2010

 Kyong-Ha Lee (bart7449@gmail.com)
   Department of Computer Science
       KAIST, Daejeon, Korea
Brief Overview of This Talk
   Basic theories and principles about database
    technology on modern HW
     • Not much discussion on implementation or tools,
      but will be happy to discuss them if there are any
      questions
   Topics
    • The immense changes in computer architecture
    • A variety of computing sources
    • Intra-node parallelism
    • The DB technology that facilitates modern HW
      features


                    Invited talk @ ETRI, © Kyong-Ha Lee    2
Things we have now in our PC
                Core 1                                                           16 Integer
 Throughput ~1 instruction per cycle.                                             registers
      One cycle takes ~0.33 ns.                                       Core 2     16 Double
The exact # of cycles depends on the in                                         FP registers
               struction
L1 D-cache                         L1 TLB              L1 D-cache
    32KB       L1 I-cache         128 entries              32KB        L1 I-cache      L1
 Latency 1ns      32KB            for I 256 en          Latency 1ns       32KB        TLB
  (3 cycles)                       tries for D           (3 cycles)
                                  L2 Unified Cache
                                                                        Intel Core 2 Duo
      L2 TLB                             6MB
                                                                        3.0GHz, E8400
                                Latency 4.7ns (14 cycles)
                                                                        Wolfdale
                                                        Front Side Bus
                                                        1,333MHz
                                                        Bandwidth: 10GB/S
                                                                       DDR3 Ram Modules
                                          Intel X48
PCI Express 2.0 x16, 8GB/s (e                                                  4 GB
          ach way)
                                         Northbridge
                                                                      Latency: ~83ns(~250 c
                                            Chip
                                Invited talk @ ETRI, © Kyong-Ha Lee
                                                                              ycles)    3
DMI Interface
                                                       Bandwidth: 1GB/s (each way)

         USB 2.0 ~30MB/s

                                                               Serial ATA port 300MB/s
       FireWire 800 ~55MB/s


   PCIe 2.0 x1, 500MB/s (each way)
                                                                            Seagate 1TB 7,200 RPM
     Wireless 802.11g ~2.5 MB/s
                                                                                    32MB
                                                                                 HDD Cache
 Gigabit wired ethernet. ~100 MB/s                                           Operates at SATA rate
                                          Intel ICH9R
                                          Southbridge
My LGU+ cable line                            chip                         Sustained disk I/O ~138Mb/S
100Mb/s up/down

                                                                            Random seek time:
                                                                            8.5ms (read)/9.5ms(write)
                                                                            25.7 million/28.8 million
            Internet                                                        cycles
                                                             original source:
                                                             http://duartes.org/gustavo/blog/post/what-your-
                                                             computer-does-while-you-wait
                                  Invited talk @ ETRI, © Kyong-Ha Lee                                    4
So what‘s happening now?
   Changes in memory hierarchy
    • Higher capacity and the emergence of Non-
      Volatile RAM(NVRAM)
   Memory wall and multi-level caches
    • Latency lags bandwidth
 Increasing number of cores in a single die
   • Multicore or CMP
 A variety of computing sources
   • CPU, GPU, NP and FPGA
 Intra-/Inter-node parallelism
   • CMP vs. MPP
                   Invited talk @ ETRI, © Kyong-Ha Lee   5
Now In Memory Hierarchy
   Very cheap HDD with VERY high capacity
     • Seagate 1TB Barracuda 3.5‘ HDD (7200rpm, 32MB) for
        74,320 won($61.94) in Aug 2010
        • 1GB for 74.3 won
   Write-once storage
     • Tape drive is dead, ODD is waning
     • Due to the poor latency and seek time
         • Seek time >= 100ms
     • although 22X DVD writer can sequentially write 4.7GBytes
        within 3 minutes (29.7MB in theory)
       1GB for 53.82 won
   Price of RAM has fallen enough to keep much more data in
    memory than before.
     • A 4GB DDR3 Memory(1,333MHz) for 108,000 won
         • 1 GB for 27,000 won ($22.5)
     • but, still cost_m >> cost_d


                            Invited talk @ ETRI, © Kyong-Ha Lee   6
The Five-Minute Rule
   Cache randomly accessed disk pages that are reused
    every 5 minutes[1].
    •    BreakEvenIntervalInSeconds 
           PagesPerMBofRAM         Pr icePerDisk Drive
                                 
         AccessPerS econdPerDi sk Pr icePerMBof RAM
   In 1987, breakeven interval was ~2 minutes
   After that, ~5 minutes in 1997, ~88 minutes in 2007.
   “Memory becomes HDD, HDD becomes Tape,
    and Tape is dead”, by Jim Gray
   Today‘s memory is ~102,400 times faster than HDD
    • Memory : 83 ns(250 cycles)
    • HDD : 8.5ms (25.7 million cycles)
       (256/116) x (61.94/0.0225) = ~101 minutes.
       => Cache your data in memory as always as possible.
                         Invited talk @ ETRI, © Kyong-Ha Lee   7
Latency lags bandwidth
   From 1983 to 2003[2]
    • Capacity increased ~ 2,500 times
        (0.03GB -> 73.4GB)
    •   Bandwidth improved 143.3 times
        (0.6 MB/s -> 86 MB/s)
    •   Latency improved 8.5 times (48.3 ->
        5.7 ms)
   Why?
    • Moore‘s law helps bandwidth more
        than latency
    •   Distance limits latency
    •   Bandwidth is generally easier to sell
    •   Latency helps bandwidth but not
        vice versa.(e.g., spinning disk faster)
    •   Bandwidth hurts latency(e.g., buffer)
    •   OS overhead hurts latency



                             Invited talk @ ETRI, © Kyong-Ha Lee   8
Latency vs. Bandwidth
   Latency can be handled by
     • Hiding (or tolerating) it – out of order issue, non blocking cache,
         prefetching
     •   Reducing it –better cache
   Parallelism sometimes helps to hide latency
     • MLP(Memory Level Parallelism) - multiple outstanding cache misses
         overlapped
     •   But increased bandwidth demand
   Latency ultimately limited by physics
   Bandwidth can be handled by ―spending ― more (HW cost)
     • Wider buses, interface, Interleaving
   Bandwidth improvement usually increases latency
     • No free lunch
   Hierarchies decreases bandwidth demand to lower levels.
     • Serve as traffic filters: a hit in L1 is filtered from L2
     • If average bandwidth is not met -> infinite queues

                              Invited talk @ ETRI, © Kyong-Ha Lee            9
NVRAM Storage: Solid Sate Disk

 Intel X25-M Mainstream(50nm) 160GB
   • Read/write latency 85/115 us
   • Random 4KB read/write: 35K/3.3K IOPS
   • Sustained sequential read/write: 250/70MB/s
   • 1GB for 3,619 won in Aug 2010
 SSD has successfully occupied the position
  between memory and HDD
   • best suited for sequential read/write
      • e.g., logging device

                 Invited talk @ ETRI, © Kyong-Ha Lee   10
Features of SSD
   No mechanical latency
     • Flash memory is an electronic device with no moving parts
     • Provides uniform random access speed without seek/rotational latency
   No in-place update
     • No data on a page can be updated in place before erasing it first
     • An erase unit (or block) is much larger than a page
   Limited lifetime
     • MLC : 0.1M times of writes, SLC : 1M times of writes
     • Wear-leveling
   Asymmetric read & write speed
     • Read speed is typically at least 3X faster than write speed
     • Write (and erase) optimization is critical
   Asymmetric seq. vs. random I/O performance
     • Random 4KB read/write: 35K/3.3K IOPS
         • 140MB/13.2MB in total size
     • Sustained sequential read/write: 250/70MB/s
   ―Disk‖ Abstraction
     • LBA(or LPA) -> (channel#, plane#, … ) or just PBA(or PPA)
     • This mapping changes each time a page write is performed
     • The controller must maintain a mapping table in RAM or Flash
                             Invited talk @ ETRI, © Kyong-Ha Lee              11
Memory wall




   Latencies
    • CPU stalls because of time spent for memory access
       • latency for memory access: 250 cycles
       • ~249 instructions are blocked waiting data from the
         memory access.
   Solution: CPU Caching!!

                     Invited talk @ ETRI, © Kyong-Ha Lee       12
Why Caching?
   Processor speeds are projected to increase about 70% per year
    for many years to com. This trend will widen the speed gap
    btw. Memory and processor caches. The caches will get larger,
    but memory speed will not keep pace with processor speeds
   Low latent memory that hides memory access latency
    • Static RAM vs. Dynamic RAM
    • 3ns(L1)~14ns(L2) vs. 83ns
   Small capacity with support of locality
    • Temporal locality
        • Recently referenced items are likely to be referenced in the near
            future
        •   Capacity limits the # of items to be kept in the cache at a time
        •   L1$ in Intel Core i7 is 32KB
    • Spatial locality
        • Items with nearby addressed tend to be referenced close together in
            time
        •   the size of one cache line
        •   e.g., a cache line size in Intel Core i7 is 64B
        •   So 32K/64B = 512 cache lines
                             Invited talk @ ETRI, © Kyong-Ha Lee               13
An Example Memory Hierarchy




                                       Source: Computer Systems,
                                       A Programmer‘s Perspective, 2003
         Invited talk @ ETRI, © Kyong-Ha Lee                     14
Memory Mountain in 2000


                                             32B cache line size




                                     Source: Computer Systems,
                                     A Programmer‘s Perspective, 2003
       Invited talk @ ETRI, © Kyong-Ha Lee                         15
Memory Mountain in 2010
                                             Intel Core i7
                                             2.67GHz
                                             32KB L1 d-cache
                                             256KB L2 cache
                                             8MB L3 cache
                                             64B cache line size




      source: http://csapp.cs.cmu.edu/public/perspective.html
       Invited talk @ ETRI, © Kyong-Ha Lee                16
CPU Cache Structure




                                   Source: Computer Systems,
                                   A Programmer‘s Perspective, 2003
     Invited talk @ ETRI, © Kyong-Ha Lee                     17
Addressing Caches




                                  Source: Computer Systems,
                                  A Programmer‘s Perspective, 2003
    Invited talk @ ETRI, © Kyong-Ha Lee                     18
Types of Cache Misses
   Cold miss(or compulsive miss)
    • Data are not loaded at first.
   Capacity miss
    • Because of limited capacity
    • must evict a victim to make space for replacement block
       • LRU or LFU
   Conflict miss
    • involves cache thrashing
    • can be alleviated by associative cache
       • e.g., 8-way set associative cache in Core2 Duo
   Coherence miss
    • Data consistency between caches


                        Invited talk @ ETRI, © Kyong-Ha Lee     19
Cache Performance
   Metrics
    • Miss rate: # of misses/# of references
        • The fraction of memory references during the execution of a
            program.
    • Hit rate :# of success/#of references
    • Hit time: the time to deliver a word in the cache to the CPU
    • Miss penalty: any additional time required because of a miss.
   Impact of :
    • Cache size: reduce capacity miss and increase both of hit rate
      and hit time
    • Cache line size: increase spatial locality and decrease temporal
      locality
    • Associativity
        • Full-associative : no conflict miss, but linear scan of cache lines
            eventually
        •   Direct-mapping: conflict miss



                           Invited talk @ ETRI, © Kyong-Ha Lee                  20
Writing Cache-Friendly Codes
   Maximizes two localities in your program
    • Remove pointers as many as possible
        • Increasing both spatial locality and update cost
    • fit the working data into a cache line and into the
      capacity of the cache
        • Increasing spatial and temporal locality
    • Use working data as often as possible once it has
      been read from memory
   Software prefetching
     • Removing cold miss rates

                     Invited talk @ ETRI, © Kyong-Ha Lee     21
Example: Matrix Multiplication




                                    *Assumptions:
                                    •Row-major order
                                    •Cache block = 8 doubles
                                    •Cache size C << n
          Invited talk @ ETRI, © Kyong-Ha Leeblocks fit into cache 3B^222 C
                                    •Three                              <
SW Prefetching
   Loop unrolling
    • for (int i=0; i < N-4; i+=4){ //inner product of double a[] and b[]
        prefetch(&a[i+4] );//32-bit machine with 32B cache line size
        prefetch(&b[i+4]);
        ip = ip + a[i]*b[i];           a[i]  a[i+1] a[i+2] a[i+3]
        ip = ip + a[i+1] * b[i+1];
                                       b[i]  b[i+1] b[i+2] b[i+3]
        ip = ip + a[i+2]* b[i+2];
        ip = ip +a[i+3]* b[i+3]; }

   Data linearization                                                       1
                                                                                     preorder traverse
                                                                     2               3
       1   2   4   5   3      6     7
                                                                 4       5       6       7

                           Invited talk @ ETRI, © Kyong-Ha Lee                                23
Optimizations in Modern
              Microprocessor
   Pipelining (Intel i486)
    • utilizes ILP(Instruction Level
        Parallelism)
    •   increases throughput but not
        latency.
   Out-of-order execution(Intel P6)
    • 96-sized inst. window(Core2), 128-
      sized inst. window(Nehalem)
    • in-order processor(Intel Atom,
      GPU)
   Superscalar(Intel P5)
    • 3-wide(Core2) , 4-wide(Nehalem)

                      Invited talk @ ETRI, © Kyong-Ha Lee   24
   Simultaneous Multi-threading(from Intel Pentium4)
    •   TLP(Thread-Level Parallelism)
    •   Hardware multi-threading
    •   Support of HW-level context switching
    •   issues multiple instructions from multiple threads in one cycle.
    •   HT(Hyper Threading) is Intel‘s term for SMT
   SIMD(Single Instruction Multiple Data)(Intel Pentium III)
    • DLP(Data-Level Parallelism)
    • 128bit SSE(Streaming SIMD Extensions) for x86 architecture




                           Invited talk @ ETRI, © Kyong-Ha Lee             25
   Branch prediction and speculative execution
    • guess which way a branch will go before this is known for
        sure
    •   To improve the flow in the ILP
   Hardware prefetching
    • Hiding latency by fetching data from memory in advance
    • Advantage
         • No need to add any instruction overhead to issue prefetches
         • No SW cost
    • Disadvantage
         • Cache pollution
         • Bandwidth can be wasted
         • H/W cost and compatibility


                         Invited talk @ ETRI, © Kyong-Ha Lee             26
Speed of a Program
 CPI(Clock Per Instruction) vs.
  IPC(Instructions Per Clock)
 MIPS(Million Instructions Per Second)
   • FLOPS(Floating Point Per Second)
      • GFLOPS, TFLOPS
 T = N x CPI x T_cycle
 Improvement
   • Reduce the # of instructions
   • Reduce CPI
   • Increase clock speed

               Invited talk @ ETRI, © Kyong-Ha Lee   27
Virtuous Cycle, circa 1950 – 2005
                                 Increased
                                 processor
                                performance

                                                          Larger, more
        Slower
                                                           feature-full
       programs
                                                             software




                  Higher-level                   Larger
                  languages &                 development
                  abstractions                   teams


 World-Wide Software Market (per IDC):
  $212b (2005)  $310b (2010)
                    Invited talk @ ETRI, © Kyong-Ha Lee                   28
Virtuous Cycle, circa 2005-??


       Slower
      programs
                                X
                                Increased
                                processor
                               performance

                                                         Larger, more
                                                          feature-full
                                                            software


GAME OVER — NEXT LEVEL?
   Threadlanguages & Parallelism &
           Level
         Higher-level      Larger
                        development
                 abstractions                    teams
           Multicore Chips


                   Invited talk @ ETRI, © Kyong-Ha Lee                   29
CMP(Chip Level Multiprocessor)
   Apple Inc. starts to sell 12-core Mac Pro(in Aug 2010)
    • ―The new Mac Pro offers two advanced processor options from Intel.
       The Quad-Core Intel Xeon ― Nehalem‖ processor is available in a
       single-processor, quad-core configuration at speeds up to 3.2GHz.
       For even greater speed and power, choose the ―Westmere‖ series,
       Intel‘s next-generation processor based on its latest 32-nm process
       technology. ‖Westmere‖ is available in both quad-core and 6-core
       versions, and the Mac Pro comes with either one or two processors.
       Which means that you can have a 6-core Mac Pro at 3.33GHz, an 8-
       core system at 2.4GHz, or, to max out your performance, a 12-core
       system at up to 2.93GHz.‖ from Apple homepage




                        Invited talk @ ETRI, © Kyong-Ha Lee           30
Multicore
   Moore‘s law is still valid
     • ―The # of transistors on an
       integrated circuit has doubled
       approximately every other year.‖-                               Core
       Gordon E. Moore, 1965                                 Shared
   Obstacles to increasing clock speed                     L2 Cache
     • Power density problem
        • ―Can soon put more transistors                               Core
          on a chip than can afford to turn                              Intel
          on‖ – Patterson‘07                                           Core2 Duo
     • Heat problem
         • e.g., Intel Pentium IV Prescott
           (3.7GHz) in 2004
 Limits in Instruction Level
  Parallelism(ILP)
=> The emergence of Multicore !!
                                                                        Intel
                                                                       Core i7
                            Invited talk @ ETRI, © Kyong-Ha Lee                  31
Chip density is continuin
g increase ~2x every 2 ye
ars
    Clock speed is not incre
    asing
    Number of processor co
    res may double instead
There is little or no hidde
n parallelism (ILP) to be f
ound
Parallelism must be expo
sed to and managed by so
ftware




                       Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

                               Invited talk @ ETRI, © Kyong-Ha Lee                  32
Can soon put more transistors on a chip than can afford to turn on.
                                                   -- Patterson ‗07

                          Scaling clock speed (business as usual) will not work
               10000                                                                           Sun‘s
                                                                                              Surface
                                                                   Rocket
  Power Density (W/cm2)




                          1000
                                                                   Nozzle

                                                        Nuclear
                           100
                                                        Reactor

                                      8086          Hot Plate
                            10 4004                                  P6
                                8008 8085    386                  Pentium®
                                         286                486
                                 8080                                                      Source: Patrick
                             1                                                             Gelsinger, Intel
                             1970      1980             1990              2000      2010
                                                        Year

                                              Invited talk @ ETRI, © Kyong-Ha Lee                      33
Parallelism Saves Power
   Exploit explicit parallelism for reducing
    power
    Power = C **V222/4 F)/4
            (C * V **FF F/2
            2C V * *                    Performance = Cores **FF
                                                      (Cores * F)*1
                                                      2Cores F/2
    Capacitance Voltage Frequency
• Using additional cores
     – Increase density (= more transistors = more capacita
       nce)
     – Can increase cores (2x) and performance (2x)
     – Or increase cores (2x), but decrease frequency (1/2): s
       ame performance at ¼ the power
• Additional benefits
    – Small/simple cores  more predictable performance
                     Invited talk @ ETRI, © Kyong-Ha Lee              34
Amdahl‘s law
 Two basic metrics
  •
  •
 Recall Amdahl‘s law [1967]
  • Simple SW assumption
  • No Overhead for
     • Scheduling, communication, synchronization, and
         etc

    • e.g.,



                  Invited talk @ ETRI, © Kyong-Ha Lee   35
Types of multicore
   Symmetric multicore
    • e.g., Core 2Duo, i5, i7, Xeon octo-core
   Assume that
    •   Each Chip Bounded to N BCEs (for all cores)
    •   Each Core consumes R BCEs
    •   Assume Symmetric Multicore = All Cores Identical
    •   Therefore, N/R Cores per Chip — (N/R)*R = N
    •   For an N = 16 BCE Chip:




    Sixteen 1-BCE cores        Four 4-BCE cores                 One 16-BCE core
                          Invited talk @ ETRI, © Kyong-Ha Lee                 36
Performance of Symmetric
            Multicore Chips
   Serial Fraction 1-F uses 1 core at rate Perf(R)
   Serial time = (1 – F) / Perf(R)
   Parallel Fraction uses N/R cores at rate Perf(R) each
   Parallel time = F / (Perf(R) * (N/R)) = F*R /
    Perf(R)*N
   Therefore, w.r.t. one base core:
                                                        1
      Symmetric Speedup =                                     F*R
                                   1-F
                                  Perf(R)
                                               +            Perf(R)*N
   Implications?
                                Enhanced Cores speed Serial & Parallel
                      Invited talk @ ETRI, © Kyong-Ha Lee               37
Symmetric Multicore Chip, N = 16
            BCEs
                          16
                                   F=0.999
                          14
                                    F=0.99           F1, R=1, Cores=16, Speedup16
      Symmetric Speedup

                          12
                                     F=0.975
                          10

                          8
                                       F=0.9
                          6
                                   F=0.9, R=2, Cores=8, Speedup=6.7
                          4
                                         F=0.5
                          2

                          0
                               1                 2                4               8    16
          (16 cores)              R BCEs     (8 cores)
                                           (2 cores)   (1 core)
                                (4 cores)
 F matters: Amdahl‘s Law applies to multicore chips
 MANY Researchers should target parallelism F first
 As Moore‘s Law increases N, often need enhanced core designs
 Some arch. researchers target on single-core performance
                                                 Invited talk @ ETRI, © Kyong-Ha Lee        38
   Asymmetric multicore
    • Cell Broadband Engine in PS3
       • 1 PPE(Power Processor Element) and 8 SPE(Synergic
         Processor Element)
   Each Chip Bounded to N BCEs (for all cores)
   One R-BCE Core leaves N-R BCEs
   Use N-R BCEs for N-R Base Cores
   Therefore, 1 + N - R Cores per Chip
   For an N = 16 BCE Chip:




Symmetric: Four 4-BCE cores             Asymmetric: One 4-BCE core
                                          & Twelve 1-BCE base cores
                      Invited talk @ ETRI, © Kyong-Ha Lee             39
Performance of Asymmetric
            Multicore Chips
   Serial Fraction 1-F same, so time = (1 – F) /
    Perf(R)
   Parallel Fraction F
    • One core at rate Perf(R)
    • N-R cores at rate 1
    • Parallel time = F / (Perf(R) + N - R)
   Therefore, w.r.t. one base core:
                                                          1
         Asymmetric Speedup =                                   F
                                         1-F
                                        Perf(R)
                                                     +    Perf(R) + N - R
                    Invited talk @ ETRI, © Kyong-Ha Lee                     40
Asymmetric Multicore Chip, N =
            256 BCEs
                            250
       Asymmetric Speedup                  F=0.999

                            200


                            150
                                              F=0.99


                            100
                                               F=0.975


                             50
                                                 F=0.9
                                                     F=0.5
                              0
                                  1   2   4          8       16    32      64   128   256
              (256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core)
                                    (1+240 cores)

Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?
                                          Invited talk @ ETRI, © Kyong-Ha Lee               41
Other laws
 Gustafson‘s law
  •                       ,α = 1-f
 Karp-Fratt metric
  • efficient to estimate serial
        fraction from the real code
    •




                     Invited talk @ ETRI, © Kyong-Ha Lee   42
Multicore Makes The Memory Wall
 
                     Problembandwidth to access memory
  Assume that each core requires 2GB/s
                                       Worse
    What if 6 cores access the memory at a time? => 12GB >> FSB bandwidth
    A prefetching scheme that is appropriate for a uniprocessor may be entirely
     inappropriate for a multiprocessor [22].
                                                                                 5.0E+09




                                                   Total CPU cycles of a query
                                                                                 4.0E+09


                                                                                                              Memory
                                                                                 3.0E+09
                                                                                                              DTLB miss
                                                                                                              L2 hit
                                                                                                              Branch Misprediction
                                                                                 2.0E+09                      Computation



                                                                                 1.0E+09



                                                                                 0.0E+00
                                                                                           1 core   8 cores
http://spectrum.ieee.org/computing/hard
ware/multicore-is-bad-news-for-supercom
puters            Solution : Sharing                                             memory access
                              Invited talk @ ETRI, © Kyong-Ha Lee                                                 43
GPU
 DLP(Data-Level Parallelism)
 GPU has become a powerful computing
  engine behind scientific computing and data-
  intensive applications
 Many light-weighted in-order cores
 has separated caches and memories
 GPGPU applications are data-intensive,
  handling long-running kernel execution(10-
  1,000s of ms) and large data units ( 1-100s of
  MB)


                Invited talk @ ETRI, © Kyong-Ha Lee   44
GPU Architecture




                                                            NVIDIA GTX 512
   16 Streaming Multiprocessors(SM), each of which
    consists of 32 Stream Processors(SPs), resulting in 512
    cores in total.
   All threads running on SPs share the same program
    called kernel
   An SM works as an independent SIMT processor.
                      Invited talk @ ETRI, © Kyong-Ha Lee             45
Levels of Parallel Granularity and
         Memory sharing
   A thread block is a batch of
    threads that can cooperate
    with each other by:
    • Synchronizing their execution
       • For hazard-free shared
         memory accesses
    • Efficiently sharing data
      through a low latency shared-
      memory
   Two threads from two
    different blocks cannot
    cooperate


                     Invited talk @ ETRI, © Kyong-Ha Lee   46
Four Execution Steps
 The DMA controller transfers data from
  host(CPU) memory to device(GPU) memory
 A host program instructs the GPU to launch
  the kernel
 The GPU executes threads in parallel
 The DMA controller transfers result from
  device memory to host memory

   Warp; a basic execution(or scheduling) unit
    of SM, a group of 32 threads sharing the
    same instruction pointer; all threads in a
    warp take the same code path.
                  Invited talk @ ETRI, © Kyong-Ha Lee   47
Comparison with CPU
• It maximizes ILP to                         • It maximizes thread-level
  accelerate a small # of                       parallelism
  threads
                                              • It devotes most of their die
• large caches and                              area to a large array of
  sophisticated control planes                  ALUs.
  for advanced features
   • e.g., superscalar, OoO                   • Memory stall can be
     execution, branch prediction,              effectively minimized with
     and speculative loads                      an enough number of
• Latency hiding is limited by                  threads
  CPU resources
                                              • Large memory
• Limited memory                                bandwidth(177.4GB/s for
  bandwidth(32GB/s for                          GTX480)
  X5550)


              CPU                                              GPU

                         Invited talk @ ETRI, © Kyong-Ha Lee               48
GPU Programming Considerations
   What to offload
    • Computation and memory-intensive algorithms with
      high regularity suit well for GPU acceleration
   How to parallelize
   Data structure usage
    • Simple data structure such as arrays are
      recommended.
   Divergency in GPU code
    • SIMT demands to have minimal code-path divergence
      caused by data-dependent conditional branches
      within a warp
   Expensive host-device memory cost

                      Invited talk @ ETRI, © Kyong-Ha Lee   49
FPGA(Field Programmable Gate
            Array)
 Von Neumann
  architecture vs.
  Hardware architecture
 Integrated circuit
  designed to be configured
  by customer.
 configuration is specified
  using a HDL(Hardware
  Description Language)

               Invited talk @ ETRI, © Kyong-Ha Lee   50
Limitations of FPGA
   Area/speed tradeoff
    • Finite CLB on a single die
    • Becomes slower and more power-
      consumptive as logic becomes more complex
 Act as a hard-wired once it is cooked
 No support of recursion calls

 Asynchronous design

 Less power efficient


                 Invited talk @ ETRI, © Kyong-Ha Lee   51
Are DB execution cache-friendly?
   DB Execution Time Breakdown (in 2005)




 At least 50% cycles on stalls.
 Memory access is major bottleneck
 Branch mispredictions increase cache misses

                Invited talk @ ETRI, © Kyong-Ha Lee   52
Modern DB techniques
• Cache-conscious                          • CMP and multithreading
   • Cache-friendly data                          • Memory scan sharing
     placement                                    • Staged DB execution
   • Data cache                            • GPGPU
       • Cache-conscious data
         structure                                • SIMT
       • Buffering index structure         • FPGA
       • Hiding latency using                     • Von-Neumann vs. HW
         Prefetching
       • Cache-conscious join
                                                    circuit
   • Instruction cache
       • Buffering
       • Staged database
         execution
   • Branch prediction
       • Reduce branches and
         SIMD



                      Invited talk @ ETRI, © Kyong-Ha Lee                 53
Record Layout Schemes
                                                               Select name f
PAX optimizes cache-to-memory communication but                rom R where
retains NSM‘s IO (page contents do not change)                   age > 50




(a) NSM(N-ary Storage Model) (b) DSM(Decomposed Storage Model) or
    Column-based (c)PAX(Partition Attribute Across)

                         Invited talk @ ETRI, © Kyong-Ha Lee          54
Main-Memory Tree Indexes
   T-tree: Balanced-binary tree proposed in 1986 for
    MMDB
    • Aim: balance space overhead with searching time.
   Main-memory B+-trees: better cache performance[4]
   Node width = cache line size (32-128B)
   Minimize number of cache misses for search
   Much higher than traditional disk-based B+-tree
    => more cache miss
   How the shallow B+-tree?




                     Invited talk @ ETRI, © Kyong-Ha Lee   55
Cache Sensitive B +-tree
   Layout child nodes contiguously
   Eliminate all but one child pointers
    • keys in one node fit in one cache line
    • Removing pointers increases the fanout of the tree, which
      results in a reduced tree height




    • 35% faster tree lookups
    • Update performance is 30% worse (splits)
                      Invited talk @ ETRI, © Kyong-Ha Lee         56
Buffering Index Structures
   buffering accesses to the index structure to avoid cache
    thrashing
   Nodes in the index tree are grouped together into
    pieces that fit within the cache
   Increase temporal locality but accesses can be delayed




                     Invited talk @ ETRI, © Kyong-Ha Lee   57
Prefetching B+-tree



   Idea: Larger nodes + prefetching
   Node size = multiple cache lines (e.g., 8 lines)
   Prefetch all lines of a node before search it
   Cost to access a node only increases slightly
   Much shallower tree, no changes required
   Improves both search and update performance

                    Invited talk @ ETRI, © Kyong-Ha Lee   58
Fractal pB+-tree




   For faster range scan
    • Leaf parent nodes contain addresses of all leaves
    • Link leaf parent nodes together
    • Use this structure for prefetching leaf nodes
   * A prefetching scheme that is appropriate for a
    uniprocessor may be entirely inappropriate for a
    multiprocessor [22].

                      Invited talk @ ETRI, © Kyong-Ha Lee   59
Cache-Conscious Hash Join
    For good temporal
    locality, two relations to
    be joined are partitioned
    into partitions that fit in
    the data cache.
   To reduce TLB misses
    caused by big H, use
    radix hash
    • In the cluster, # of
        random accesses is low
    •   a large number of
        clusters can be created by
        making multiple passes
        through the data
                       Invited talk @ ETRI, © Kyong-Ha Lee   60
Group prefetching




   Invited talk @ ETRI, © Kyong-Ha Lee   61
a group




          Invited talk @ ETRI, © Kyong-Ha Lee   62
Buffering tuples btw. operators
   group consecutive operators into execution groups
    whose operators fit into the L1 I-cache.
   buffering output of the execution group
   I-Cache misses are amortized over multiple tuples
    and i-cache thrashing is avoided




                   Invited talk @ ETRI, © Kyong-Ha Lee   63
How SMTs can help DB
           performance
 Bi-threaded: partition input, cooperative
  threads
 Work-ahead-set: main thread + helper thread
   • Main thread posts ―work-ahead set‖ to a queue
   • Helper thread issues load instructions for the
    requests




                 Invited talk @ ETRI, © Kyong-Ha Lee   64
Staged Database Execution Model
   TX may be divided into stages that fit in the L1 I-cache
   When one tx reaches the end of stage, system switches
    context to a different thread that needs to execute the
    same stage.                 Stage S0
                                                           LOAD X
                   LOAD X                                  STORE Y
                   STORE Y                                 STORE Y
                   STORE Y
                                        Stage S1
                   LOAD Y                                  LOAD Y
                     ….                                      ….
                   STORE Z                                 STORE Z

                   LOAD Z
                     ….                 Stage S2
                                                           LOAD Z
                                                             ….
                     Invited talk @ ETRI, © Kyong-Ha Lee             65
Stage Spawning

     LOAD X                          LOAD Y
                                                                        LOAD Z
S0   STORE Y                   S1      ….
                                                                      S2 ….
     STORE Y                         STORE Z


                Core 0                 Core 1                Core 2




Work-queues




               Instances            Instances               Instances
                 of S0                of S1                   of S2
                      Invited talk @ ETRI, © Kyong-Ha Lee                        66
Main-Memory Scan Sharing




  •Memory scan sharing also increases temporal locality
  •Too many sharing can cause cache thrashing




               Invited talk @ ETRI, © Kyong-Ha Lee        67
Summary
 Latency is a major problem
 Cache-friendly programming is
  indispensible
 Chip level multiprocessor requires to be
  used for TLP
 Facilitating diverse computing sources is a
  challenge


               Invited talk @ ETRI, © Kyong-Ha Lee   68
Further readings
1.    Jim Gray, Gianfrano R. Putzolu, The 5 Minute Rule for Trading Memory for Disk Accesses and
      The 10 Byte Rule for Trading Memory for CPU Time, SIGMOD 1987: 395-398
2.    David A. Patterson, Latency lags bandwidth, CACM, Vol. 47, No. 10 pp. 71—75, 2004
3.    Mark Hill and et al., Amdahl‘s law in multicore era, IEEE Computer, Vol. 41, No. 7 pp. 33-38,
      2008
4.    J. Rao and et al., Cache Conscious Indexing for Decision-Support in Main Memory
5.    P. A. Boncz and et al., Breaking the Memory wall in monetDB, CACM, Dec 2008
6.    Shimin Chen and et al., Improving Hash Join Performance through Prefetching, ICDE 2004
7.    Jingren Zhou and et al., Implementing Database Operations Using SIMD instructions, SIGMOD
      2002
8.    J. Cieslewicz and K.A. Ross, Database Optimizations for Modern Hardware, Proceedings of the
      IEEE 96(5), 2009
9.    Lawrence Sparcklen and et al., Chip Multithreading: Opportunities and Challenges
10.   Nikos Hardavellas and et al., Database Servers on Chip Multiprocessors: Limitations and
      Opportunities, CIDR 2007
11.   Lin Qiao and et al., Main-Memory Scan Sharing For Multi-Core CPUs, PVLDB 2008
12.   Ryan Johnson and et al., To Share or Not to Share?, VLDB 2007
13.   Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB 2009
14.   Database Architectures for New Hardware, Tutorial, in the 30th VLDB, 2004 and in the 21st
      ICDE 2005
15.   Query Co-processing on Commodity Processors, Tutorial in the 22nd ICDE 2006.
16.   John Nickolls and et al., GPU Computing Era, IEEE Micro March/April 2010
17.   Kayvon Fatahalian and et al., A Closer Look at GPUs, CACM Vol. 51, No.10, 2008



                                 Invited talk @ ETRI, © Kyong-Ha Lee                            69
18.   John Nickolls and et al., Scalable Parallel Programming, March/April ACM Queue, 2008
19.   N.K. GOvindaraju and et al., GPUTeraSort: High performance graphics co-processor sorting for large
      database management, SIGMOD 2006
20.   A. Mitra and et al., Boosting XML Filtering with a Scalable FPGA-based Architecture, CIDR 2009
21.   S. Harizopoulos and A. Ailamaki and et al., Improving instruction cache performance in OLTP, ACM
      TODS, vol. 31, pp. 887-920
22.   T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory
      multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991

*Courses available on Internet
   Introduction to Computer Systems @CMU, 2000~2010
      • http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f10/www/index.html
   Multicore Programming Primer @MIT, 2007 (with video)
      • http://groups.csail.mit.edu/cag/ps3/index.shtml
   Introduction to Multiprocessor Synchronization @Brown
      • http://www.cs.brown.edu/courses/cs176
   Parallel Programming for Multicore @Berkeley, Spring 2007
      • http://www.cs.berkeley.edu/~yelick/cs194f07/
   Applications of Parallel Computing @Berkeley, Spring 2007
      • http://www.cs.berkeley.edu/~yelick/cs267_sp07/
   High-Performance Computing for Applications in Engineering @Wisc, Autumn 2008
      • http://sbel.wisc.edu/Courses/ME964/2008/index.htm
   High Performance Computing Training @Lawrence Livermore National Laboratory
      • https://computing.llnl.gov/?set=training&page=index
   Programming Massively Parallel Processors with CUDA @Stanford, Spring 2010 (with video)
      • on Itunes U and Youtube.com


                                   Invited talk @ ETRI, © Kyong-Ha Lee                                70

More Related Content

What's hot

Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUseHortonworks
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environmentinside-BigData.com
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 

What's hot (20)

Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
10c introduction
10c introduction10c introduction
10c introduction
 
Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environment
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 

Viewers also liked

Latency Trumps All
Latency Trumps AllLatency Trumps All
Latency Trumps Allguest22d4179
 
Technology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential NetworksTechnology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential NetworksAlexander Pico
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaBrendan Gregg
 
Sanciones Ourense
Sanciones OurenseSanciones Ourense
Sanciones Ourenseguest6b97a3
 
Ieatwjane110409
Ieatwjane110409Ieatwjane110409
Ieatwjane110409DrLukens
 
6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede
6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede
6.Uretim Dagitim 16.40 17.10 Urun GelistirmedeErmando
 
Matt's COH powerpoint
Matt's COH powerpointMatt's COH powerpoint
Matt's COH powerpointguest171f98
 
Pagerank
PagerankPagerank
PagerankGabriel
 
6.Oracle Day2009 Engin Senel V2
6.Oracle Day2009 Engin Senel V26.Oracle Day2009 Engin Senel V2
6.Oracle Day2009 Engin Senel V2Ermando
 
【佐賀大学】平成21年環境報告書
【佐賀大学】平成21年環境報告書【佐賀大学】平成21年環境報告書
【佐賀大学】平成21年環境報告書env49
 
Arts Education Policy
Arts Education PolicyArts Education Policy
Arts Education Policyboswellw
 
Workwear
WorkwearWorkwear
Workwearkausarh
 
Foto Expositie Portretten uit Cuba
Foto Expositie Portretten uit CubaFoto Expositie Portretten uit Cuba
Foto Expositie Portretten uit Cubableijenberg
 
La Xarxa Internet
La Xarxa InternetLa Xarxa Internet
La Xarxa Internetpaulayule
 
Grafica
GraficaGrafica
Graficajeffzm
 

Viewers also liked (20)

Latency Trumps All
Latency Trumps AllLatency Trumps All
Latency Trumps All
 
Technology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential NetworksTechnology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential Networks
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
First In Thirst
First In ThirstFirst In Thirst
First In Thirst
 
Sanciones Ourense
Sanciones OurenseSanciones Ourense
Sanciones Ourense
 
Ieatwjane110409
Ieatwjane110409Ieatwjane110409
Ieatwjane110409
 
6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede
6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede
6.Uretim Dagitim 16.40 17.10 Urun Gelistirmede
 
Matt's COH powerpoint
Matt's COH powerpointMatt's COH powerpoint
Matt's COH powerpoint
 
Pagerank
PagerankPagerank
Pagerank
 
6.Oracle Day2009 Engin Senel V2
6.Oracle Day2009 Engin Senel V26.Oracle Day2009 Engin Senel V2
6.Oracle Day2009 Engin Senel V2
 
【佐賀大学】平成21年環境報告書
【佐賀大学】平成21年環境報告書【佐賀大学】平成21年環境報告書
【佐賀大学】平成21年環境報告書
 
Arts Education Policy
Arts Education PolicyArts Education Policy
Arts Education Policy
 
Eventi online - gestione e promozione
Eventi online - gestione e promozioneEventi online - gestione e promozione
Eventi online - gestione e promozione
 
Workwear
WorkwearWorkwear
Workwear
 
Foto Expositie Portretten uit Cuba
Foto Expositie Portretten uit CubaFoto Expositie Portretten uit Cuba
Foto Expositie Portretten uit Cuba
 
La Xarxa Internet
La Xarxa InternetLa Xarxa Internet
La Xarxa Internet
 
Bootchart 송형주
Bootchart 송형주Bootchart 송형주
Bootchart 송형주
 
Grafica
GraficaGrafica
Grafica
 
Front end anno 2014
Front end anno 2014Front end anno 2014
Front end anno 2014
 
lee
leelee
lee
 

Similar to Database Research on Modern Computing Architecture

Storage: Alternate Futures
Storage: Alternate FuturesStorage: Alternate Futures
Storage: Alternate Futures小新 制造
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processorsArun Kumar
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit erarinnocente
 
The future of tape april 16
The future of tape april 16The future of tape april 16
The future of tape april 16Josef Weingand
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architectureinside-BigData.com
 
Memory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, LatticeMemory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, LatticeFPGA Central
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentationSaket Vihari
 
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014Amazon Web Services
 
Sigmod08ssd slides
Sigmod08ssd slidesSigmod08ssd slides
Sigmod08ssd slidesheybbs2009
 

Similar to Database Research on Modern Computing Architecture (20)

Storage: Alternate Futures
Storage: Alternate FuturesStorage: Alternate Futures
Storage: Alternate Futures
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
Cpu spec
Cpu specCpu spec
Cpu spec
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Did you know
Did you knowDid you know
Did you know
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
Spesifikasi server
Spesifikasi serverSpesifikasi server
Spesifikasi server
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit era
 
Shignled disk
Shignled diskShignled disk
Shignled disk
 
The future of tape april 16
The future of tape april 16The future of tape april 16
The future of tape april 16
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
Sun Microsystems
Sun MicrosystemsSun Microsystems
Sun Microsystems
 
Memory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, LatticeMemory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, Lattice
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation
 
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
 
Sigmod08ssd slides
Sigmod08ssd slidesSigmod08ssd slides
Sigmod08ssd slides
 

More from Kyong-Ha Lee

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
좋은 논문 찾기
좋은 논문 찾기좋은 논문 찾기
좋은 논문 찾기Kyong-Ha Lee
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXMLKyong-Ha Lee
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingKyong-Ha Lee
 

More from Kyong-Ha Lee (7)

SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
좋은 논문 찾기
좋은 논문 찾기좋은 논문 찾기
좋은 논문 찾기
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXML
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query ProcessingBitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Database Research on Modern Computing Architecture

  • 1. Database Research on Modern Computing Architecture September 10, 2010 Kyong-Ha Lee (bart7449@gmail.com) Department of Computer Science KAIST, Daejeon, Korea
  • 2. Brief Overview of This Talk  Basic theories and principles about database technology on modern HW • Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions  Topics • The immense changes in computer architecture • A variety of computing sources • Intra-node parallelism • The DB technology that facilitates modern HW features Invited talk @ ETRI, © Kyong-Ha Lee 2
  • 3. Things we have now in our PC Core 1 16 Integer Throughput ~1 instruction per cycle. registers One cycle takes ~0.33 ns. Core 2 16 Double The exact # of cycles depends on the in FP registers struction L1 D-cache L1 TLB L1 D-cache 32KB L1 I-cache 128 entries 32KB L1 I-cache L1 Latency 1ns 32KB for I 256 en Latency 1ns 32KB TLB (3 cycles) tries for D (3 cycles) L2 Unified Cache Intel Core 2 Duo L2 TLB 6MB 3.0GHz, E8400 Latency 4.7ns (14 cycles) Wolfdale Front Side Bus 1,333MHz Bandwidth: 10GB/S DDR3 Ram Modules Intel X48 PCI Express 2.0 x16, 8GB/s (e 4 GB ach way) Northbridge Latency: ~83ns(~250 c Chip Invited talk @ ETRI, © Kyong-Ha Lee ycles) 3
  • 4. DMI Interface Bandwidth: 1GB/s (each way) USB 2.0 ~30MB/s Serial ATA port 300MB/s FireWire 800 ~55MB/s PCIe 2.0 x1, 500MB/s (each way) Seagate 1TB 7,200 RPM Wireless 802.11g ~2.5 MB/s 32MB HDD Cache Gigabit wired ethernet. ~100 MB/s Operates at SATA rate Intel ICH9R Southbridge My LGU+ cable line chip Sustained disk I/O ~138Mb/S 100Mb/s up/down Random seek time: 8.5ms (read)/9.5ms(write) 25.7 million/28.8 million Internet cycles original source: http://duartes.org/gustavo/blog/post/what-your- computer-does-while-you-wait Invited talk @ ETRI, © Kyong-Ha Lee 4
  • 5. So what‘s happening now?  Changes in memory hierarchy • Higher capacity and the emergence of Non- Volatile RAM(NVRAM)  Memory wall and multi-level caches • Latency lags bandwidth  Increasing number of cores in a single die • Multicore or CMP  A variety of computing sources • CPU, GPU, NP and FPGA  Intra-/Inter-node parallelism • CMP vs. MPP Invited talk @ ETRI, © Kyong-Ha Lee 5
  • 6. Now In Memory Hierarchy  Very cheap HDD with VERY high capacity • Seagate 1TB Barracuda 3.5‘ HDD (7200rpm, 32MB) for 74,320 won($61.94) in Aug 2010 • 1GB for 74.3 won  Write-once storage • Tape drive is dead, ODD is waning • Due to the poor latency and seek time • Seek time >= 100ms • although 22X DVD writer can sequentially write 4.7GBytes within 3 minutes (29.7MB in theory)  1GB for 53.82 won  Price of RAM has fallen enough to keep much more data in memory than before. • A 4GB DDR3 Memory(1,333MHz) for 108,000 won • 1 GB for 27,000 won ($22.5) • but, still cost_m >> cost_d Invited talk @ ETRI, © Kyong-Ha Lee 6
  • 7. The Five-Minute Rule  Cache randomly accessed disk pages that are reused every 5 minutes[1]. • BreakEvenIntervalInSeconds  PagesPerMBofRAM Pr icePerDisk Drive  AccessPerS econdPerDi sk Pr icePerMBof RAM  In 1987, breakeven interval was ~2 minutes  After that, ~5 minutes in 1997, ~88 minutes in 2007.  “Memory becomes HDD, HDD becomes Tape, and Tape is dead”, by Jim Gray  Today‘s memory is ~102,400 times faster than HDD • Memory : 83 ns(250 cycles) • HDD : 8.5ms (25.7 million cycles)  (256/116) x (61.94/0.0225) = ~101 minutes.  => Cache your data in memory as always as possible. Invited talk @ ETRI, © Kyong-Ha Lee 7
  • 8. Latency lags bandwidth  From 1983 to 2003[2] • Capacity increased ~ 2,500 times (0.03GB -> 73.4GB) • Bandwidth improved 143.3 times (0.6 MB/s -> 86 MB/s) • Latency improved 8.5 times (48.3 -> 5.7 ms)  Why? • Moore‘s law helps bandwidth more than latency • Distance limits latency • Bandwidth is generally easier to sell • Latency helps bandwidth but not vice versa.(e.g., spinning disk faster) • Bandwidth hurts latency(e.g., buffer) • OS overhead hurts latency Invited talk @ ETRI, © Kyong-Ha Lee 8
  • 9. Latency vs. Bandwidth  Latency can be handled by • Hiding (or tolerating) it – out of order issue, non blocking cache, prefetching • Reducing it –better cache  Parallelism sometimes helps to hide latency • MLP(Memory Level Parallelism) - multiple outstanding cache misses overlapped • But increased bandwidth demand  Latency ultimately limited by physics  Bandwidth can be handled by ―spending ― more (HW cost) • Wider buses, interface, Interleaving  Bandwidth improvement usually increases latency • No free lunch  Hierarchies decreases bandwidth demand to lower levels. • Serve as traffic filters: a hit in L1 is filtered from L2 • If average bandwidth is not met -> infinite queues Invited talk @ ETRI, © Kyong-Ha Lee 9
  • 10. NVRAM Storage: Solid Sate Disk  Intel X25-M Mainstream(50nm) 160GB • Read/write latency 85/115 us • Random 4KB read/write: 35K/3.3K IOPS • Sustained sequential read/write: 250/70MB/s • 1GB for 3,619 won in Aug 2010  SSD has successfully occupied the position between memory and HDD • best suited for sequential read/write • e.g., logging device Invited talk @ ETRI, © Kyong-Ha Lee 10
  • 11. Features of SSD  No mechanical latency • Flash memory is an electronic device with no moving parts • Provides uniform random access speed without seek/rotational latency  No in-place update • No data on a page can be updated in place before erasing it first • An erase unit (or block) is much larger than a page  Limited lifetime • MLC : 0.1M times of writes, SLC : 1M times of writes • Wear-leveling  Asymmetric read & write speed • Read speed is typically at least 3X faster than write speed • Write (and erase) optimization is critical  Asymmetric seq. vs. random I/O performance • Random 4KB read/write: 35K/3.3K IOPS • 140MB/13.2MB in total size • Sustained sequential read/write: 250/70MB/s  ―Disk‖ Abstraction • LBA(or LPA) -> (channel#, plane#, … ) or just PBA(or PPA) • This mapping changes each time a page write is performed • The controller must maintain a mapping table in RAM or Flash Invited talk @ ETRI, © Kyong-Ha Lee 11
  • 12. Memory wall  Latencies • CPU stalls because of time spent for memory access • latency for memory access: 250 cycles • ~249 instructions are blocked waiting data from the memory access.  Solution: CPU Caching!! Invited talk @ ETRI, © Kyong-Ha Lee 12
  • 13. Why Caching?  Processor speeds are projected to increase about 70% per year for many years to com. This trend will widen the speed gap btw. Memory and processor caches. The caches will get larger, but memory speed will not keep pace with processor speeds  Low latent memory that hides memory access latency • Static RAM vs. Dynamic RAM • 3ns(L1)~14ns(L2) vs. 83ns  Small capacity with support of locality • Temporal locality • Recently referenced items are likely to be referenced in the near future • Capacity limits the # of items to be kept in the cache at a time • L1$ in Intel Core i7 is 32KB • Spatial locality • Items with nearby addressed tend to be referenced close together in time • the size of one cache line • e.g., a cache line size in Intel Core i7 is 64B • So 32K/64B = 512 cache lines Invited talk @ ETRI, © Kyong-Ha Lee 13
  • 14. An Example Memory Hierarchy Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 14
  • 15. Memory Mountain in 2000 32B cache line size Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 15
  • 16. Memory Mountain in 2010 Intel Core i7 2.67GHz 32KB L1 d-cache 256KB L2 cache 8MB L3 cache 64B cache line size source: http://csapp.cs.cmu.edu/public/perspective.html Invited talk @ ETRI, © Kyong-Ha Lee 16
  • 17. CPU Cache Structure Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 17
  • 18. Addressing Caches Source: Computer Systems, A Programmer‘s Perspective, 2003 Invited talk @ ETRI, © Kyong-Ha Lee 18
  • 19. Types of Cache Misses  Cold miss(or compulsive miss) • Data are not loaded at first.  Capacity miss • Because of limited capacity • must evict a victim to make space for replacement block • LRU or LFU  Conflict miss • involves cache thrashing • can be alleviated by associative cache • e.g., 8-way set associative cache in Core2 Duo  Coherence miss • Data consistency between caches Invited talk @ ETRI, © Kyong-Ha Lee 19
  • 20. Cache Performance  Metrics • Miss rate: # of misses/# of references • The fraction of memory references during the execution of a program. • Hit rate :# of success/#of references • Hit time: the time to deliver a word in the cache to the CPU • Miss penalty: any additional time required because of a miss.  Impact of : • Cache size: reduce capacity miss and increase both of hit rate and hit time • Cache line size: increase spatial locality and decrease temporal locality • Associativity • Full-associative : no conflict miss, but linear scan of cache lines eventually • Direct-mapping: conflict miss Invited talk @ ETRI, © Kyong-Ha Lee 20
  • 21. Writing Cache-Friendly Codes  Maximizes two localities in your program • Remove pointers as many as possible • Increasing both spatial locality and update cost • fit the working data into a cache line and into the capacity of the cache • Increasing spatial and temporal locality • Use working data as often as possible once it has been read from memory  Software prefetching • Removing cold miss rates Invited talk @ ETRI, © Kyong-Ha Lee 21
  • 22. Example: Matrix Multiplication *Assumptions: •Row-major order •Cache block = 8 doubles •Cache size C << n Invited talk @ ETRI, © Kyong-Ha Leeblocks fit into cache 3B^222 C •Three <
  • 23. SW Prefetching  Loop unrolling • for (int i=0; i < N-4; i+=4){ //inner product of double a[] and b[] prefetch(&a[i+4] );//32-bit machine with 32B cache line size prefetch(&b[i+4]); ip = ip + a[i]*b[i]; a[i] a[i+1] a[i+2] a[i+3] ip = ip + a[i+1] * b[i+1]; b[i] b[i+1] b[i+2] b[i+3] ip = ip + a[i+2]* b[i+2]; ip = ip +a[i+3]* b[i+3]; }  Data linearization 1 preorder traverse 2 3 1 2 4 5 3 6 7 4 5 6 7 Invited talk @ ETRI, © Kyong-Ha Lee 23
  • 24. Optimizations in Modern Microprocessor  Pipelining (Intel i486) • utilizes ILP(Instruction Level Parallelism) • increases throughput but not latency.  Out-of-order execution(Intel P6) • 96-sized inst. window(Core2), 128- sized inst. window(Nehalem) • in-order processor(Intel Atom, GPU)  Superscalar(Intel P5) • 3-wide(Core2) , 4-wide(Nehalem) Invited talk @ ETRI, © Kyong-Ha Lee 24
  • 25. Simultaneous Multi-threading(from Intel Pentium4) • TLP(Thread-Level Parallelism) • Hardware multi-threading • Support of HW-level context switching • issues multiple instructions from multiple threads in one cycle. • HT(Hyper Threading) is Intel‘s term for SMT  SIMD(Single Instruction Multiple Data)(Intel Pentium III) • DLP(Data-Level Parallelism) • 128bit SSE(Streaming SIMD Extensions) for x86 architecture Invited talk @ ETRI, © Kyong-Ha Lee 25
  • 26. Branch prediction and speculative execution • guess which way a branch will go before this is known for sure • To improve the flow in the ILP  Hardware prefetching • Hiding latency by fetching data from memory in advance • Advantage • No need to add any instruction overhead to issue prefetches • No SW cost • Disadvantage • Cache pollution • Bandwidth can be wasted • H/W cost and compatibility Invited talk @ ETRI, © Kyong-Ha Lee 26
  • 27. Speed of a Program  CPI(Clock Per Instruction) vs. IPC(Instructions Per Clock)  MIPS(Million Instructions Per Second) • FLOPS(Floating Point Per Second) • GFLOPS, TFLOPS  T = N x CPI x T_cycle  Improvement • Reduce the # of instructions • Reduce CPI • Increase clock speed Invited talk @ ETRI, © Kyong-Ha Lee 27
  • 28. Virtuous Cycle, circa 1950 – 2005 Increased processor performance Larger, more Slower feature-full programs software Higher-level Larger languages & development abstractions teams World-Wide Software Market (per IDC): $212b (2005)  $310b (2010) Invited talk @ ETRI, © Kyong-Ha Lee 28
  • 29. Virtuous Cycle, circa 2005-?? Slower programs X Increased processor performance Larger, more feature-full software GAME OVER — NEXT LEVEL? Threadlanguages & Parallelism & Level Higher-level Larger development abstractions teams Multicore Chips Invited talk @ ETRI, © Kyong-Ha Lee 29
  • 30. CMP(Chip Level Multiprocessor)  Apple Inc. starts to sell 12-core Mac Pro(in Aug 2010) • ―The new Mac Pro offers two advanced processor options from Intel. The Quad-Core Intel Xeon ― Nehalem‖ processor is available in a single-processor, quad-core configuration at speeds up to 3.2GHz. For even greater speed and power, choose the ―Westmere‖ series, Intel‘s next-generation processor based on its latest 32-nm process technology. ‖Westmere‖ is available in both quad-core and 6-core versions, and the Mac Pro comes with either one or two processors. Which means that you can have a 6-core Mac Pro at 3.33GHz, an 8- core system at 2.4GHz, or, to max out your performance, a 12-core system at up to 2.93GHz.‖ from Apple homepage Invited talk @ ETRI, © Kyong-Ha Lee 30
  • 31. Multicore  Moore‘s law is still valid • ―The # of transistors on an integrated circuit has doubled approximately every other year.‖- Core Gordon E. Moore, 1965 Shared  Obstacles to increasing clock speed L2 Cache • Power density problem • ―Can soon put more transistors Core on a chip than can afford to turn Intel on‖ – Patterson‘07 Core2 Duo • Heat problem • e.g., Intel Pentium IV Prescott (3.7GHz) in 2004  Limits in Instruction Level Parallelism(ILP) => The emergence of Multicore !! Intel Core i7 Invited talk @ ETRI, © Kyong-Ha Lee 31
  • 32. Chip density is continuin g increase ~2x every 2 ye ars Clock speed is not incre asing Number of processor co res may double instead There is little or no hidde n parallelism (ILP) to be f ound Parallelism must be expo sed to and managed by so ftware Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Invited talk @ ETRI, © Kyong-Ha Lee 32
  • 33. Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‗07 Scaling clock speed (business as usual) will not work 10000 Sun‘s Surface Rocket Power Density (W/cm2) 1000 Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 8085 386 Pentium® 286 486 8080 Source: Patrick 1 Gelsinger, Intel 1970 1980 1990 2000 2010 Year Invited talk @ ETRI, © Kyong-Ha Lee 33
  • 34. Parallelism Saves Power  Exploit explicit parallelism for reducing power Power = C **V222/4 F)/4 (C * V **FF F/2 2C V * * Performance = Cores **FF (Cores * F)*1 2Cores F/2 Capacitance Voltage Frequency • Using additional cores – Increase density (= more transistors = more capacita nce) – Can increase cores (2x) and performance (2x) – Or increase cores (2x), but decrease frequency (1/2): s ame performance at ¼ the power • Additional benefits – Small/simple cores  more predictable performance Invited talk @ ETRI, © Kyong-Ha Lee 34
  • 35. Amdahl‘s law  Two basic metrics • •  Recall Amdahl‘s law [1967] • Simple SW assumption • No Overhead for • Scheduling, communication, synchronization, and etc  • e.g.,  Invited talk @ ETRI, © Kyong-Ha Lee 35
  • 36. Types of multicore  Symmetric multicore • e.g., Core 2Duo, i5, i7, Xeon octo-core  Assume that • Each Chip Bounded to N BCEs (for all cores) • Each Core consumes R BCEs • Assume Symmetric Multicore = All Cores Identical • Therefore, N/R Cores per Chip — (N/R)*R = N • For an N = 16 BCE Chip: Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core Invited talk @ ETRI, © Kyong-Ha Lee 36
  • 37. Performance of Symmetric Multicore Chips  Serial Fraction 1-F uses 1 core at rate Perf(R)  Serial time = (1 – F) / Perf(R)  Parallel Fraction uses N/R cores at rate Perf(R) each  Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N  Therefore, w.r.t. one base core: 1 Symmetric Speedup = F*R 1-F Perf(R) + Perf(R)*N  Implications? Enhanced Cores speed Serial & Parallel Invited talk @ ETRI, © Kyong-Ha Lee 37
  • 38. Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 14 F=0.99 F1, R=1, Cores=16, Speedup16 Symmetric Speedup 12 F=0.975 10 8 F=0.9 6 F=0.9, R=2, Cores=8, Speedup=6.7 4 F=0.5 2 0 1 2 4 8 16 (16 cores) R BCEs (8 cores) (2 cores) (1 core) (4 cores) F matters: Amdahl‘s Law applies to multicore chips MANY Researchers should target parallelism F first As Moore‘s Law increases N, often need enhanced core designs Some arch. researchers target on single-core performance Invited talk @ ETRI, © Kyong-Ha Lee 38
  • 39. Asymmetric multicore • Cell Broadband Engine in PS3 • 1 PPE(Power Processor Element) and 8 SPE(Synergic Processor Element)  Each Chip Bounded to N BCEs (for all cores)  One R-BCE Core leaves N-R BCEs  Use N-R BCEs for N-R Base Cores  Therefore, 1 + N - R Cores per Chip  For an N = 16 BCE Chip: Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core & Twelve 1-BCE base cores Invited talk @ ETRI, © Kyong-Ha Lee 39
  • 40. Performance of Asymmetric Multicore Chips  Serial Fraction 1-F same, so time = (1 – F) / Perf(R)  Parallel Fraction F • One core at rate Perf(R) • N-R cores at rate 1 • Parallel time = F / (Perf(R) + N - R)  Therefore, w.r.t. one base core: 1 Asymmetric Speedup = F 1-F Perf(R) + Perf(R) + N - R Invited talk @ ETRI, © Kyong-Ha Lee 40
  • 41. Asymmetric Multicore Chip, N = 256 BCEs 250 Asymmetric Speedup F=0.999 200 150 F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 (256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core) (1+240 cores) Number of Cores = 1 (Enhanced) + 256 – R (Base) How do Asymmetric & Symmetric speedups compare? Invited talk @ ETRI, © Kyong-Ha Lee 41
  • 42. Other laws  Gustafson‘s law • ,α = 1-f  Karp-Fratt metric • efficient to estimate serial fraction from the real code • Invited talk @ ETRI, © Kyong-Ha Lee 42
  • 43. Multicore Makes The Memory Wall  Problembandwidth to access memory Assume that each core requires 2GB/s Worse  What if 6 cores access the memory at a time? => 12GB >> FSB bandwidth  A prefetching scheme that is appropriate for a uniprocessor may be entirely inappropriate for a multiprocessor [22]. 5.0E+09 Total CPU cycles of a query 4.0E+09 Memory 3.0E+09 DTLB miss L2 hit Branch Misprediction 2.0E+09 Computation 1.0E+09 0.0E+00 1 core 8 cores http://spectrum.ieee.org/computing/hard ware/multicore-is-bad-news-for-supercom puters Solution : Sharing memory access Invited talk @ ETRI, © Kyong-Ha Lee 43
  • 44. GPU  DLP(Data-Level Parallelism)  GPU has become a powerful computing engine behind scientific computing and data- intensive applications  Many light-weighted in-order cores  has separated caches and memories  GPGPU applications are data-intensive, handling long-running kernel execution(10- 1,000s of ms) and large data units ( 1-100s of MB) Invited talk @ ETRI, © Kyong-Ha Lee 44
  • 45. GPU Architecture NVIDIA GTX 512  16 Streaming Multiprocessors(SM), each of which consists of 32 Stream Processors(SPs), resulting in 512 cores in total.  All threads running on SPs share the same program called kernel  An SM works as an independent SIMT processor. Invited talk @ ETRI, © Kyong-Ha Lee 45
  • 46. Levels of Parallel Granularity and Memory sharing  A thread block is a batch of threads that can cooperate with each other by: • Synchronizing their execution • For hazard-free shared memory accesses • Efficiently sharing data through a low latency shared- memory  Two threads from two different blocks cannot cooperate Invited talk @ ETRI, © Kyong-Ha Lee 46
  • 47. Four Execution Steps  The DMA controller transfers data from host(CPU) memory to device(GPU) memory  A host program instructs the GPU to launch the kernel  The GPU executes threads in parallel  The DMA controller transfers result from device memory to host memory  Warp; a basic execution(or scheduling) unit of SM, a group of 32 threads sharing the same instruction pointer; all threads in a warp take the same code path. Invited talk @ ETRI, © Kyong-Ha Lee 47
  • 48. Comparison with CPU • It maximizes ILP to • It maximizes thread-level accelerate a small # of parallelism threads • It devotes most of their die • large caches and area to a large array of sophisticated control planes ALUs. for advanced features • e.g., superscalar, OoO • Memory stall can be execution, branch prediction, effectively minimized with and speculative loads an enough number of • Latency hiding is limited by threads CPU resources • Large memory • Limited memory bandwidth(177.4GB/s for bandwidth(32GB/s for GTX480) X5550) CPU GPU Invited talk @ ETRI, © Kyong-Ha Lee 48
  • 49. GPU Programming Considerations  What to offload • Computation and memory-intensive algorithms with high regularity suit well for GPU acceleration  How to parallelize  Data structure usage • Simple data structure such as arrays are recommended.  Divergency in GPU code • SIMT demands to have minimal code-path divergence caused by data-dependent conditional branches within a warp  Expensive host-device memory cost Invited talk @ ETRI, © Kyong-Ha Lee 49
  • 50. FPGA(Field Programmable Gate Array)  Von Neumann architecture vs. Hardware architecture  Integrated circuit designed to be configured by customer.  configuration is specified using a HDL(Hardware Description Language) Invited talk @ ETRI, © Kyong-Ha Lee 50
  • 51. Limitations of FPGA  Area/speed tradeoff • Finite CLB on a single die • Becomes slower and more power- consumptive as logic becomes more complex  Act as a hard-wired once it is cooked  No support of recursion calls  Asynchronous design  Less power efficient Invited talk @ ETRI, © Kyong-Ha Lee 51
  • 52. Are DB execution cache-friendly?  DB Execution Time Breakdown (in 2005)  At least 50% cycles on stalls.  Memory access is major bottleneck  Branch mispredictions increase cache misses Invited talk @ ETRI, © Kyong-Ha Lee 52
  • 53. Modern DB techniques • Cache-conscious • CMP and multithreading • Cache-friendly data • Memory scan sharing placement • Staged DB execution • Data cache • GPGPU • Cache-conscious data structure • SIMT • Buffering index structure • FPGA • Hiding latency using • Von-Neumann vs. HW Prefetching • Cache-conscious join circuit • Instruction cache • Buffering • Staged database execution • Branch prediction • Reduce branches and SIMD Invited talk @ ETRI, © Kyong-Ha Lee 53
  • 54. Record Layout Schemes Select name f PAX optimizes cache-to-memory communication but rom R where retains NSM‘s IO (page contents do not change) age > 50 (a) NSM(N-ary Storage Model) (b) DSM(Decomposed Storage Model) or Column-based (c)PAX(Partition Attribute Across) Invited talk @ ETRI, © Kyong-Ha Lee 54
  • 55. Main-Memory Tree Indexes  T-tree: Balanced-binary tree proposed in 1986 for MMDB • Aim: balance space overhead with searching time.  Main-memory B+-trees: better cache performance[4]  Node width = cache line size (32-128B)  Minimize number of cache misses for search  Much higher than traditional disk-based B+-tree => more cache miss  How the shallow B+-tree? Invited talk @ ETRI, © Kyong-Ha Lee 55
  • 56. Cache Sensitive B +-tree  Layout child nodes contiguously  Eliminate all but one child pointers • keys in one node fit in one cache line • Removing pointers increases the fanout of the tree, which results in a reduced tree height • 35% faster tree lookups • Update performance is 30% worse (splits) Invited talk @ ETRI, © Kyong-Ha Lee 56
  • 57. Buffering Index Structures  buffering accesses to the index structure to avoid cache thrashing  Nodes in the index tree are grouped together into pieces that fit within the cache  Increase temporal locality but accesses can be delayed Invited talk @ ETRI, © Kyong-Ha Lee 57
  • 58. Prefetching B+-tree  Idea: Larger nodes + prefetching  Node size = multiple cache lines (e.g., 8 lines)  Prefetch all lines of a node before search it  Cost to access a node only increases slightly  Much shallower tree, no changes required  Improves both search and update performance Invited talk @ ETRI, © Kyong-Ha Lee 58
  • 59. Fractal pB+-tree  For faster range scan • Leaf parent nodes contain addresses of all leaves • Link leaf parent nodes together • Use this structure for prefetching leaf nodes  * A prefetching scheme that is appropriate for a uniprocessor may be entirely inappropriate for a multiprocessor [22]. Invited talk @ ETRI, © Kyong-Ha Lee 59
  • 60. Cache-Conscious Hash Join  For good temporal locality, two relations to be joined are partitioned into partitions that fit in the data cache.  To reduce TLB misses caused by big H, use radix hash • In the cluster, # of random accesses is low • a large number of clusters can be created by making multiple passes through the data Invited talk @ ETRI, © Kyong-Ha Lee 60
  • 61. Group prefetching Invited talk @ ETRI, © Kyong-Ha Lee 61
  • 62. a group Invited talk @ ETRI, © Kyong-Ha Lee 62
  • 63. Buffering tuples btw. operators  group consecutive operators into execution groups whose operators fit into the L1 I-cache.  buffering output of the execution group  I-Cache misses are amortized over multiple tuples and i-cache thrashing is avoided Invited talk @ ETRI, © Kyong-Ha Lee 63
  • 64. How SMTs can help DB performance  Bi-threaded: partition input, cooperative threads  Work-ahead-set: main thread + helper thread • Main thread posts ―work-ahead set‖ to a queue • Helper thread issues load instructions for the requests Invited talk @ ETRI, © Kyong-Ha Lee 64
  • 65. Staged Database Execution Model  TX may be divided into stages that fit in the L1 I-cache  When one tx reaches the end of stage, system switches context to a different thread that needs to execute the same stage. Stage S0 LOAD X LOAD X STORE Y STORE Y STORE Y STORE Y Stage S1 LOAD Y LOAD Y …. …. STORE Z STORE Z LOAD Z …. Stage S2 LOAD Z …. Invited talk @ ETRI, © Kyong-Ha Lee 65
  • 66. Stage Spawning LOAD X LOAD Y LOAD Z S0 STORE Y S1 …. S2 …. STORE Y STORE Z Core 0 Core 1 Core 2 Work-queues Instances Instances Instances of S0 of S1 of S2 Invited talk @ ETRI, © Kyong-Ha Lee 66
  • 67. Main-Memory Scan Sharing •Memory scan sharing also increases temporal locality •Too many sharing can cause cache thrashing Invited talk @ ETRI, © Kyong-Ha Lee 67
  • 68. Summary  Latency is a major problem  Cache-friendly programming is indispensible  Chip level multiprocessor requires to be used for TLP  Facilitating diverse computing sources is a challenge Invited talk @ ETRI, © Kyong-Ha Lee 68
  • 69. Further readings 1. Jim Gray, Gianfrano R. Putzolu, The 5 Minute Rule for Trading Memory for Disk Accesses and The 10 Byte Rule for Trading Memory for CPU Time, SIGMOD 1987: 395-398 2. David A. Patterson, Latency lags bandwidth, CACM, Vol. 47, No. 10 pp. 71—75, 2004 3. Mark Hill and et al., Amdahl‘s law in multicore era, IEEE Computer, Vol. 41, No. 7 pp. 33-38, 2008 4. J. Rao and et al., Cache Conscious Indexing for Decision-Support in Main Memory 5. P. A. Boncz and et al., Breaking the Memory wall in monetDB, CACM, Dec 2008 6. Shimin Chen and et al., Improving Hash Join Performance through Prefetching, ICDE 2004 7. Jingren Zhou and et al., Implementing Database Operations Using SIMD instructions, SIGMOD 2002 8. J. Cieslewicz and K.A. Ross, Database Optimizations for Modern Hardware, Proceedings of the IEEE 96(5), 2009 9. Lawrence Sparcklen and et al., Chip Multithreading: Opportunities and Challenges 10. Nikos Hardavellas and et al., Database Servers on Chip Multiprocessors: Limitations and Opportunities, CIDR 2007 11. Lin Qiao and et al., Main-Memory Scan Sharing For Multi-Core CPUs, PVLDB 2008 12. Ryan Johnson and et al., To Share or Not to Share?, VLDB 2007 13. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB 2009 14. Database Architectures for New Hardware, Tutorial, in the 30th VLDB, 2004 and in the 21st ICDE 2005 15. Query Co-processing on Commodity Processors, Tutorial in the 22nd ICDE 2006. 16. John Nickolls and et al., GPU Computing Era, IEEE Micro March/April 2010 17. Kayvon Fatahalian and et al., A Closer Look at GPUs, CACM Vol. 51, No.10, 2008 Invited talk @ ETRI, © Kyong-Ha Lee 69
  • 70. 18. John Nickolls and et al., Scalable Parallel Programming, March/April ACM Queue, 2008 19. N.K. GOvindaraju and et al., GPUTeraSort: High performance graphics co-processor sorting for large database management, SIGMOD 2006 20. A. Mitra and et al., Boosting XML Filtering with a Scalable FPGA-based Architecture, CIDR 2009 21. S. Harizopoulos and A. Ailamaki and et al., Improving instruction cache performance in OLTP, ACM TODS, vol. 31, pp. 887-920 22. T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991 *Courses available on Internet  Introduction to Computer Systems @CMU, 2000~2010 • http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f10/www/index.html  Multicore Programming Primer @MIT, 2007 (with video) • http://groups.csail.mit.edu/cag/ps3/index.shtml  Introduction to Multiprocessor Synchronization @Brown • http://www.cs.brown.edu/courses/cs176  Parallel Programming for Multicore @Berkeley, Spring 2007 • http://www.cs.berkeley.edu/~yelick/cs194f07/  Applications of Parallel Computing @Berkeley, Spring 2007 • http://www.cs.berkeley.edu/~yelick/cs267_sp07/  High-Performance Computing for Applications in Engineering @Wisc, Autumn 2008 • http://sbel.wisc.edu/Courses/ME964/2008/index.htm  High Performance Computing Training @Lawrence Livermore National Laboratory • https://computing.llnl.gov/?set=training&page=index  Programming Massively Parallel Processors with CUDA @Stanford, Spring 2010 (with video) • on Itunes U and Youtube.com Invited talk @ ETRI, © Kyong-Ha Lee 70