The Multicore Midlife Crisis
       Bogdan Marius Tudor

            CSTalks
         30 March 2011
Outline
•    The Memory Problem
•    Do We Need All These Cores?
•    Tomorrow’s Multicore
•    Research Perspective




5/4/11                             2
Remember Single Core?




                                 Wikipedia
5/4/11                                  3
My Next Processors
                     4000


                     3000
Cache Size [kB]




                     2000


                     1000


                           0
                                66      200      1000      2250      1600      2400     2400
                               MHz      MHz      MHz       MHz       MHz       MHz      MHz
                               Apr-94




                                        Apr-98



                                                  Nov-01


                                                            May-04

                                                                      Jul-06

                                                                               Jul-08


                                                                                         Mar-11
                  5/4/11                                                                          4
My Next Processors
                     4000


                     3000
Cache Size [kB]




                     2000


                     1000


                           0
                                66      200      1000      2250      1600      2400     2400
                               MHz      MHz      MHz       MHz       MHz       MHz      MHz
                               Apr-94




                                        Apr-98



                                                  Nov-01


                                                            May-04

                                                                      Jul-06

                                                                               Jul-08


                                                                                         Mar-11
                  5/4/11                                                                          5
So What?

Yeap, they improved the cache size. Do I care?



The interesting part is why they did it.




5/4/11                                           6
The Memory Problem
•  Moore’s Law: the number                    Processor
   of transistors double
                                          Core Core Core Core
   every 18 months
         –  Singlecore: new transistors
            = faster speed
         –  Multicore: new transistors          Cache
            = more cores


•  Memory speed increase
                                               Memory
   does not obey Moore’s
   Law!

5/4/11                                                          7
The Memory Problem
•  Problem: More cores compete for same slow
   memory!
•  Implications:
         IF              IF         ID Queue

         ID              ID
         X             Stalled!

         M                        access to cache
                                     or RAM
         W

         J 5 cycles    L > 100 cycles
5/4/11                                              8
The Memory Problem
•  Problem: More cores compete for same slow
   memory!
•  Solution: Increase cache size J
         –  Maintain cache hit rate
            •  2x cache hit rate requires 4x cache size
            •  Exponential increase in #transistors need
         –  Cache coherence overhead



5/4/11                                                     9
Increasing Cache Size



                                                                    Not practical!




         B. M. Rogers et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. ISCA 2009

5/4/11                                                                                                     10
Other Approaches
•  Improve memory speed
         –  Slow, power-hungry and error-prone
•  Better caching
•  Improve memory bandwidth
         –  Latency tradeoff
•  Prefetch
         –  Mixed blessings
•  Allow more in-flight requests
5/4/11                                           11
Do We Need All These Cores?
•  Average utilization: < 20%
•  We don’t have too many parallel apps
•  We just have enough compute power

•  Until you try to encode an HD video
         –  Star Trek holodecks: not there yet

•  CPU vendors still have to make a living

5/4/11                                           12
Tomorrow’s Multicore




                                Intel

5/4/11                                  13
Tomorrow’s Multicore
•  Intel Core i3, i5, i7
         –  Video is integrated into CPU
         –  Must balance sequential and parallel performance
         –  Lower energy requirements than prev. generations
•  Heterogeneous cores
         –  Many, slow, good at floating points
         –  Some general purpose cores
         –  “Combine” cores into super-cores
•  Must live with the memory problems
5/4/11                                                     14
Tomorrow’s Multicore
•  The number of cores is becoming less
   important
         –  They can’t keep increasing them
         –  i3, i5, i7: how many cores each?




5/4/11                                         15
Tomorrow’s Multicore




                                Wikipedia
5/4/11                                16
Tomorrow’s Multicore
•  The number of cores is becoming less
   important
         –  They can’t keep increasing them
         –  i3, i5, i7: how many cores each?
•  Important is what the system provides
         –  FLOP intensive: GPU-style cores
         –  I/O intensive: FAWN (CMU)
         –  Memory intensive: Opteron/Xeon NUMA servers

5/4/11                                                17
A Research Perspective
•  Coping with heterogeneity is hard
         –  Different degrees of parallelism have different
            sequential executions speeds
         –  Many tradeoffs: Speed vs. Energy vs. Memory
            intensity vs. I/O intensity
•  Need models for heterogeneity
         –  Understand the cost of the applications in terms
            of FLOPS, INTOPS, memory, I/O etc.
•  Silver lining: stick to sequential apps (?)

5/4/11                                                         18
A Research Perspective
•  Coping with slow memory
•  Need to improve data locality by orders of
   magnitude
         •  Compiler support, auto-tunners etc.
•  Space-efficient data types:
         •  HOT area in algo & systems
         •  Bloom filters: NSDI’10: 3 papers!
         •  Succinct data structures: STOC’08-STOC’10
         •  Cache oblivious algorithms

5/4/11                                                  19
A Research Perspective
•  Software-helped cache coherence
         –  Or go without it J
•  Renounce some programming patterns
            •  Java initializes all objects to some value…
            •  Rethink those hash tables
•  Go for approximate solutions
         –  It’s better if you can provide error bounds



5/4/11                                                       20
Discussion


         Thank you for your attention




5/4/11                                  21

CSTalks - The Multicore Midlife Crisis - 30 Mar

  • 1.
    The Multicore MidlifeCrisis Bogdan Marius Tudor CSTalks 30 March 2011
  • 2.
    Outline •  The Memory Problem •  Do We Need All These Cores? •  Tomorrow’s Multicore •  Research Perspective 5/4/11 2
  • 3.
    Remember Single Core? Wikipedia 5/4/11 3
  • 4.
    My Next Processors 4000 3000 Cache Size [kB] 2000 1000 0 66 200 1000 2250 1600 2400 2400 MHz MHz MHz MHz MHz MHz MHz Apr-94 Apr-98 Nov-01 May-04 Jul-06 Jul-08 Mar-11 5/4/11 4
  • 5.
    My Next Processors 4000 3000 Cache Size [kB] 2000 1000 0 66 200 1000 2250 1600 2400 2400 MHz MHz MHz MHz MHz MHz MHz Apr-94 Apr-98 Nov-01 May-04 Jul-06 Jul-08 Mar-11 5/4/11 5
  • 6.
    So What? Yeap, theyimproved the cache size. Do I care? The interesting part is why they did it. 5/4/11 6
  • 7.
    The Memory Problem • Moore’s Law: the number Processor of transistors double Core Core Core Core every 18 months –  Singlecore: new transistors = faster speed –  Multicore: new transistors Cache = more cores •  Memory speed increase Memory does not obey Moore’s Law! 5/4/11 7
  • 8.
    The Memory Problem • Problem: More cores compete for same slow memory! •  Implications: IF IF ID Queue ID ID X Stalled! M access to cache or RAM W J 5 cycles L > 100 cycles 5/4/11 8
  • 9.
    The Memory Problem • Problem: More cores compete for same slow memory! •  Solution: Increase cache size J –  Maintain cache hit rate •  2x cache hit rate requires 4x cache size •  Exponential increase in #transistors need –  Cache coherence overhead 5/4/11 9
  • 10.
    Increasing Cache Size Not practical! B. M. Rogers et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. ISCA 2009 5/4/11 10
  • 11.
    Other Approaches •  Improvememory speed –  Slow, power-hungry and error-prone •  Better caching •  Improve memory bandwidth –  Latency tradeoff •  Prefetch –  Mixed blessings •  Allow more in-flight requests 5/4/11 11
  • 12.
    Do We NeedAll These Cores? •  Average utilization: < 20% •  We don’t have too many parallel apps •  We just have enough compute power •  Until you try to encode an HD video –  Star Trek holodecks: not there yet •  CPU vendors still have to make a living 5/4/11 12
  • 13.
    Tomorrow’s Multicore Intel 5/4/11 13
  • 14.
    Tomorrow’s Multicore •  IntelCore i3, i5, i7 –  Video is integrated into CPU –  Must balance sequential and parallel performance –  Lower energy requirements than prev. generations •  Heterogeneous cores –  Many, slow, good at floating points –  Some general purpose cores –  “Combine” cores into super-cores •  Must live with the memory problems 5/4/11 14
  • 15.
    Tomorrow’s Multicore •  Thenumber of cores is becoming less important –  They can’t keep increasing them –  i3, i5, i7: how many cores each? 5/4/11 15
  • 16.
    Tomorrow’s Multicore Wikipedia 5/4/11 16
  • 17.
    Tomorrow’s Multicore •  Thenumber of cores is becoming less important –  They can’t keep increasing them –  i3, i5, i7: how many cores each? •  Important is what the system provides –  FLOP intensive: GPU-style cores –  I/O intensive: FAWN (CMU) –  Memory intensive: Opteron/Xeon NUMA servers 5/4/11 17
  • 18.
    A Research Perspective • Coping with heterogeneity is hard –  Different degrees of parallelism have different sequential executions speeds –  Many tradeoffs: Speed vs. Energy vs. Memory intensity vs. I/O intensity •  Need models for heterogeneity –  Understand the cost of the applications in terms of FLOPS, INTOPS, memory, I/O etc. •  Silver lining: stick to sequential apps (?) 5/4/11 18
  • 19.
    A Research Perspective • Coping with slow memory •  Need to improve data locality by orders of magnitude •  Compiler support, auto-tunners etc. •  Space-efficient data types: •  HOT area in algo & systems •  Bloom filters: NSDI’10: 3 papers! •  Succinct data structures: STOC’08-STOC’10 •  Cache oblivious algorithms 5/4/11 19
  • 20.
    A Research Perspective • Software-helped cache coherence –  Or go without it J •  Renounce some programming patterns •  Java initializes all objects to some value… •  Rethink those hash tables •  Go for approximate solutions –  It’s better if you can provide error bounds 5/4/11 20
  • 21.
    Discussion Thank you for your attention 5/4/11 21