SlideShare a Scribd company logo
Multi-core architectures




                    1
Single-core computer




                 2
Single-core CPU chip
                     the single core




                 3
Multi-core architectures
• This lecture is about a new trend in
  computer architecture:
  Replicate multiple processor cores on a
  single die.
    Core 1            Core 2   Core 3   Core 4




Multi-core CPU chip                       4
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)


   c          c         c         c
   o          o         o         o
   r          r         r         r
   e          e         e         e

   1          2         3         4



                                  5
The cores run in parallel
    thread 1       thread 2       thread 3       thread 4




c              c              c              c
o              o              o              o
r              r              r              r
e              e              e              e

1              2              3              4




                                             6
Within each core, threads are time-sliced
       (just like on a uniprocessor)
     several       several       several       several
     threads       threads       threads       threads




 c             c             c             c
 o             o             o             o
 r             r             r             r
 e             e             e             e

 1             2             3             4




                                           7
Interaction with the
           Operating System
• OS perceives each core as a separate processor

• OS scheduler maps threads/processes
  to different cores

• Most major OS support multi-core today:
  Windows, Linux, Mac OS X, …



                                     8
Why multi-core ?
• Difficult to make single-core
  clock frequencies even higher
• Deeply pipelined circuits:
  –   heat problems
  –   speed of light problems
  –   difficult design and verification
  –   large design teams necessary
  –   server farms need expensive
      air-conditioning
• Many new applications are multithreaded
• General trend in computer architecture (shift
  towards more parallelism)            9
Instruction-level parallelism
• Parallelism at the machine-instruction
  level
• The processor can re-order, pipeline
  instructions, split them into
  microinstructions, do aggressive branch
  prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the
  last 15 years
                                   10
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
  thread (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot
  fully exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
                                    11
General context: Multiprocessors
• Multiprocessor is any
  computer with several
  processors

• SIMD
  – Single instruction, multiple data           Lemieux cluster,
                                                     Pittsburgh
  – Modern graphics cards                       supercomputing
                                                          center
• MIMD
  – Multiple instructions, multiple data
                                           12
Multiprocessor memory types
• Shared memory:
  In this model, there is one (large) common
  shared memory for all processors

• Distributed memory:
  In this model, each processor has its own
  (small) local memory, and its content is
  not replicated anywhere else

                                 13
Multi-core processor is a special
     kind of a multiprocessor:
  All processors are on the same chip
• Multi-core processors are MIMD:
  Different cores execute different threads
  (Multiple Instructions), operating on different
  parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:
  All cores share the same memory


                                         14
What applications benefit
         from multi-core?
• Database servers
• Web servers (Web commerce)           Each can
• Compilers                            run on its
                                       own core
• Multimedia applications
• Scientific applications,
  CAD/CAM
• In general, applications with
  Thread-level parallelism
  (as opposed to instruction-
  level parallelism)
                                  15
More examples
• Editing a photo while recording a TV show
  through a digital video recorder
• Downloading software while running an
  anti-virus program
• “Anything that can be threaded today will
  map efficiently to multi-core”
• BUT: some applications difficult to
  parallelize
                                 16
A technique complementary to multi-core:
       Simultaneous multithreading
• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                              Integer        Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
                                                      BTB     Trace Cache          uCode
    arrive from memory                                                             ROM

 Other execution units                                        Decoder
                               Bus


 wait unused                                                BTB and I-TLB
                                                                        Source: Intel
                                                                17
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core

• Example: if one thread is waiting for a floating
  point operation to complete, another thread can
  use the integer units


                                       18
Without SMT, only a single thread
    can run at any given time
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache       uCode ROM

                                       Decoder
     Bus




                                   BTB and I-TLB

                                              Thread 1: floating point
                                                             19
Without SMT, only a single thread
    can run at any given time
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache      uCode ROM

                                       Decoder
     Bus




                                   BTB and I-TLB

                             Thread 2:
                             integer operation           20
SMT processor: both threads can
       run concurrently
                                 L1 D-Cache D-TLB

                           Integer          Floating Point
    L2 Cache and Control



                                     Schedulers

                                   Uop queues

                                  Rename/Alloc

                           BTB     Trace Cache       uCode ROM

                                      Decoder
    Bus




                                  BTB and I-TLB

                            Thread 2:        Thread 1: floating point
                            integer operation               21
But: Can’t simultaneously use the
       same functional unit
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache      uCode ROM

                                       Decoder       This scenario is
     Bus




                                                     impossible with SMT
                                   BTB and I-TLB
                                                     on a single core
                             Thread 1 Thread 2       (assuming a single
                                IMPOSSIBLE           integer unit)
                                                          22
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
  simultaneous thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources
                                  23
Multi-core:
  threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                              Schedulers                                           Schedulers

                              Uop queues                                           Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB      Trace Cache   uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                                Decoder                                              Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                             Thread 1                                          Thread24
                                                                                      2
Multi-core:
  threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                                   Thread 3                                             25 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT: our fish machines
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads” 26
SMT Dual-core: all four threads can
       run concurrently
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                         Thread 1 Thread 3                                     Thread27 Thread 4
                                                                                      2
Comparison: multi-core vs SMT
• Advantages/disadvantages?




                              28
Comparison: multi-core vs SMT
• Multi-core:
  – Since there are several cores,
    each is smaller and not as powerful
    (but also easier to design and manufacture)
  – However, great with thread-level parallelism
• SMT
  – Can have one large and fast superscalar core
  – Great performance on a single thread
  – Mostly still only exploits instruction-level
    parallelism
                                      29
The memory hierarchy
• If simultaneous multithreading only:
  – all caches shared
• Multi-core chips:
  – L1 caches private
  – L2 caches private in some architectures
    and shared in others
• Memory is always shared


                                     30
“Fish” machines
                                      hyper-threads
• Dual-core
  Intel Xeon processors




                          CORE1




                                              CORE0
• Each core is                    L1 cache            L1 cache
  hyper-threaded
                                       L2 cache

• Private L1 caches
                                       memory
• Shared L2 caches
                                         31
CORE1   Designs with private L2 caches


                   CORE0




                                      CORE1




                                                           CORE0
        L1 cache           L1 cache           L1 cache             L1 cache

        L2 cache           L2 cache           L2 cache             L2 cache

                                              L3 cache             L3 cache
             memory
                                                     memory
   Both L1 and L2 are private
                                              A design with L3 caches
   Examples: AMD Opteron,
   AMD Athlon, Intel Pentium D                Example: Intel Itanium 2
                                                         32
Private vs shared caches?
• Advantages/disadvantages?




                              33
Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the
    same cache data
  – More cache space available if a single (or a
    few) high-performance thread runs on the
    system
                                      34
The cache coherence problem
• Since we have private caches:
  How to keep the data consistent across caches?
• Each core should perceive the memory as a
  monolithic array, shared by all the cores




                                    35
The cache coherence problem
Suppose variable x initially contains 15213


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache




                                               multi-core chip
               Main memory
                x=15213                          36
The cache coherence problem
Core 1 reads x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213



                                               multi-core chip
               Main memory
                x=15213                          37
The cache coherence problem
Core 2 reads x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213           x=15213



                                               multi-core chip
               Main memory
                x=15213                          38
The cache coherence problem
Core 1 writes to x, setting it to 21660


   Core 1            Core 2            Core 3         Core 4




 One or more       One or more      One or more     One or more
  levels of         levels of        levels of       levels of
    cache             cache            cache           cache
  x=21660           x=15213



                                                  multi-core chip
               Main memory       assuming
                x=21660          write-through      39
                                 caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=21660           x=15213



                                               multi-core chip
               Main memory
                x=21660                          40
Solutions for cache coherence
• This is a general problem with
  multiprocessors, not limited just to multi-core
• There exist many solution algorithms,
  coherence protocols, etc.

• A simple solution:
  invalidation-based protocol with snooping


                                     41
Inter-core bus


  Core 1            Core 2        Core 3          Core 4




One or more       One or more   One or more     One or more
 levels of         levels of     levels of       levels of
   cache             cache         cache           cache




                                              multi-core chip
              Main memory
                                              inter-core
                                              bus42
Invalidation protocol with snooping
• Invalidation:
  If a core writes to a data item, all other
  copies of this data item in other caches
  are invalidated
• Snooping:
  All cores continuously “snoop” (monitor)
  the bus connecting the cores.


                                   43
The cache coherence problem
Revisited: Cores 1 and 2 have both read x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213           x=15213



                                               multi-core chip
               Main memory
                x=15213                          44
The cache coherence problem
Core 1 writes to x, setting it to 21660


     Core 1           Core 2            Core 3         Core 4




  One or more       One or more      One or more     One or more
   levels of         levels of        levels of       levels of
     cache             cache            cache           cache
   x=21660           x=15213

sends                     INVALIDATED
invalidation
                                                   multi-core chip
request
                Main memory       assuming         inter-core
                 x=21660          write-through       45
                                                   bus
                                  caches
The cache coherence problem
After invalidation:


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=21660



                                               multi-core chip
               Main memory
                x=21660                          46
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.


     Core 1            Core 2        Core 3          Core 4




   One or more       One or more   One or more     One or more
    levels of         levels of     levels of       levels of
      cache             cache         cache           cache
    x=21660           x=21660



                                                 multi-core chip
                 Main memory
                  x=21660                          47
Alternative to invalidate protocol:
               update protocol
Core 1 writes x=21660:


     Core 1           Core 2            Core 3         Core 4




  One or more       One or more      One or more     One or more
   levels of         levels of        levels of       levels of
     cache             cache            cache           cache
   x=21660           x=21660
                          UPDATED

broadcasts
                                                   multi-core chip
updated
value           Main memory       assuming         inter-core
                 x=21660          write-through       48
                                                   bus
                                  caches
Which do you think is better?
  Invalidation or update?




                      49
Invalidation vs update
• Multiple writes to the same location
  – invalidation: only the first time
  – update: must broadcast each write
            (which includes new variable value)


• Invalidation generally performs better:
  it generates less bus traffic


                                     50
Invalidation protocols
• This was just the basic
  invalidation protocol
• More sophisticated protocols
  use extra cache state bits
• MSI, MESI
  (Modified, Exclusive, Shared, Invalid)



                                  51
Programming for multi-core
• Programmers must use threads or
  processes

• Spread the workload across multiple cores

• Write parallel algorithms

• OS will map threads/processes to cores
                                52
Thread safety very important
• Pre-emptive context switching:
  context switch can happen AT ANY TIME

• True concurrency, not just uniprocessor
  time-slicing

• Concurrency bugs exposed much faster
  with multi-core

                                 53
However: Need to use synchronization
even if only time-slicing on a uniprocessor
 int counter=0;

 void thread1() {
   int temp1=counter;
   counter = temp1 + 1;
 }

 void thread2() {
   int temp2=counter;
   counter = temp2 + 1;
 }
                                54
Need to use synchronization even if only
    time-slicing on a uniprocessor

temp1=counter;
counter = temp1 + 1;   gives counter=2
temp2=counter;
counter = temp2 + 1

temp1=counter;
temp2=counter;         gives counter=1
counter = temp1 + 1;
counter = temp2 + 1

                              55
Assigning threads to the cores

• Each thread/process has an affinity mask

• Affinity mask specifies what cores the
  thread is allowed to run on

• Different threads can have different masks

• Affinities are inherited across fork()
                                    56
Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT

         1        1       0        1




       core 3   core 2   core 1   core 0



 • Process/thread is allowed to run on
   cores 0,2,3, but not on core 1
                                       57
Affinity masks when multi-core and
             SMT combined
• Separate bits for each simultaneous thread
• Example: 4-way multi-core, 2 threads per core
           1        1        0         0       1        0       1        1


            core 3            core 2               core 1           core 0



         thread   thread   thread   thread   thread   thread   thread   thread
            1        0        1        0        1        0        1        0


 • Core 2 can’t run the process
 • Core 1 can only use one simultaneous 58
                                         thread
Default Affinities
• Default affinity mask is all 1s:
  all threads can run on all processors

• Then, the OS scheduler decides what
  threads run on what core

• OS scheduler detects skewed workloads,
  migrating threads to less busy processors

                                  59
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as
  much as possible:
  it tends to keeps a thread on the same core
• This is called soft affinity



                                 60
Hard affinities

• The programmer can prescribe her own
  affinities (hard affinities)

• Rule of thumb: use the default scheduler
  unless a good reason not to



                                 61
When to set your own affinities
• Two (or more) threads share data-structures in
  memory
   – map to same core so that can share cache
• Real-time threads:
  Example: a thread running
  a robot controller:
  - must not be context switched,
    or else robot can go unstable               Source: Sensable.com

  - dedicate an entire core just to this thread

                                                 62
Kernel scheduler API
#include <sched.h>
int sched_getaffinity(pid_t pid,
  unsigned int len, unsigned long * mask);

Retrieves the current affinity mask of process ‘pid’ and
   stores it into space pointed to by ‘mask’.
‘len’ is the system word size: sizeof(unsigned int long)




                                              63
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid,
   unsigned int len, unsigned long * mask);

Sets the current affinity mask of process ‘pid’ to *mask
‘len’ is the system word size: sizeof(unsigned int long)

To query affinity of a running process:
[barbic@bonito ~]$ taskset -p 3935
pid 3935's current affinity mask: f


                                              64
Windows Task Manager

                         core 2




                     core 1




                65
Legal licensing issues
• Will software vendors charge a separate
  license per each core or only a single
  license per chip?

• Microsoft, Red Hat Linux, Suse Linux will
  license their OS per chip, not per core



                                  66
Conclusion
• Multi-core chips an
  important new trend in
  computer architecture

• Several new multi-core
  chips in design phases

• Parallel programming techniques
  likely to gain importance
                               67

More Related Content

What's hot

Computer system architecture
Computer system architectureComputer system architecture
Computer system architectureKumar
 
Cache memory ppt
Cache memory ppt  Cache memory ppt
Cache memory ppt
Arpita Naik
 
Computer architecture cache memory
Computer architecture cache memoryComputer architecture cache memory
Computer architecture cache memory
Mazin Alwaaly
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
Fraboni Ec
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
Deepak John
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
Burhan Ahmed
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
Rashid Ansari
 
Superscalar Processor
Superscalar ProcessorSuperscalar Processor
Superscalar Processor
Manash Kumar Mondal
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systems
Hariharan Ganesan
 
Parallel processing
Parallel processingParallel processing
Parallel processing
Syed Zaid Irshad
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
Mazin Alwaaly
 
ARM Processors
ARM ProcessorsARM Processors
ARM Processors
Mathivanan Natarajan
 
1 Computer Architecture
1 Computer Architecture1 Computer Architecture
1 Computer Architecture
fika sweety
 
Cache memory
Cache memoryCache memory
Cache memoryAnuj Modi
 
Single &amp;Multi Core processor
Single &amp;Multi Core processorSingle &amp;Multi Core processor
Single &amp;Multi Core processor
Justify Shadap
 
Direct memory access (dma)
Direct memory access (dma)Direct memory access (dma)
Direct memory access (dma)
Zubair Khalid
 
Kernel (OS)
Kernel (OS)Kernel (OS)
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessorKishan Panara
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
Akhila Prabhakaran
 

What's hot (20)

Computer system architecture
Computer system architectureComputer system architecture
Computer system architecture
 
Cache memory ppt
Cache memory ppt  Cache memory ppt
Cache memory ppt
 
Computer architecture cache memory
Computer architecture cache memoryComputer architecture cache memory
Computer architecture cache memory
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
Superscalar Processor
Superscalar ProcessorSuperscalar Processor
Superscalar Processor
 
Real Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systemsReal Time Operating system (RTOS) - Embedded systems
Real Time Operating system (RTOS) - Embedded systems
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
ARM Processors
ARM ProcessorsARM Processors
ARM Processors
 
1 Computer Architecture
1 Computer Architecture1 Computer Architecture
1 Computer Architecture
 
Cache memory
Cache memoryCache memory
Cache memory
 
Single &amp;Multi Core processor
Single &amp;Multi Core processorSingle &amp;Multi Core processor
Single &amp;Multi Core processor
 
Direct memory access (dma)
Direct memory access (dma)Direct memory access (dma)
Direct memory access (dma)
 
Kernel (OS)
Kernel (OS)Kernel (OS)
Kernel (OS)
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 

Similar to Multi core-architecture

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
HARMAN Services
 
27 multicore
27 multicore27 multicore
27 multicore
Rishabh Jain
 
27 multicore
27 multicore27 multicore
27 multicore
ssuser47ae65
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
Siddhartha Anand
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
shubhangisonawane6
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
MKKhaing
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
Vajira Thambawita
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
Young Alista
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
shinolajla
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
Memory
MemoryMemory

Similar to Multi core-architecture (20)

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
27 multicore
27 multicore27 multicore
27 multicore
 
27 multicore
27 multicore27 multicore
27 multicore
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Extlect04
Extlect04Extlect04
Extlect04
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
4 threads
4 threads4 threads
4 threads
 
Memory
MemoryMemory
Memory
 

More from Piyush Mittal

Power mock
Power mockPower mock
Power mock
Piyush Mittal
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorialPiyush Mittal
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computingPiyush Mittal
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manualPiyush Mittal
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDAPiyush Mittal
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding TheoryPiyush Mittal
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheetPiyush Mittal
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheetPiyush Mittal
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheatPiyush Mittal
 

More from Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

Multi core-architecture

  • 3. Single-core CPU chip the single core 3
  • 4. Multi-core architectures • This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 4
  • 5. Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 5
  • 6. The cores run in parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 7
  • 8. Interaction with the Operating System • OS perceives each core as a separate processor • OS scheduler maps threads/processes to different cores • Most major OS support multi-core today: Windows, Linux, Mac OS X, … 8
  • 9. Why multi-core ? • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: – heat problems – speed of light problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism) 9
  • 10. Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 10
  • 11. Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12. General context: Multiprocessors • Multiprocessor is any computer with several processors • SIMD – Single instruction, multiple data Lemieux cluster, Pittsburgh – Modern graphics cards supercomputing center • MIMD – Multiple instructions, multiple data 12
  • 13. Multiprocessor memory types • Shared memory: In this model, there is one (large) common shared memory for all processors • Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else 13
  • 14. Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip • Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). • Multi-core is a shared memory multiprocessor: All cores share the same memory 14
  • 15. What applications benefit from multi-core? • Database servers • Web servers (Web commerce) Each can • Compilers run on its own core • Multimedia applications • Scientific applications, CAD/CAM • In general, applications with Thread-level parallelism (as opposed to instruction- level parallelism) 15
  • 16. More examples • Editing a photo while recording a TV show through a digital video recorder • Downloading software while running an anti-virus program • “Anything that can be threaded today will map efficiently to multi-core” • BUT: some applications difficult to parallelize 16
  • 17. A technique complementary to multi-core: Simultaneous multithreading • Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Other execution units Decoder Bus wait unused BTB and I-TLB Source: Intel 17
  • 18. Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 18
  • 19. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 19
  • 20. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 20
  • 21. SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 21
  • 22. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is Bus impossible with SMT BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 22
  • 23. SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 23
  • 24. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread24 2
  • 25. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 25 Thread 4
  • 26. Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads” 26
  • 27. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread27 Thread 4 2
  • 28. Comparison: multi-core vs SMT • Advantages/disadvantages? 28
  • 29. Comparison: multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism • SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism 29
  • 30. The memory hierarchy • If simultaneous multithreading only: – all caches shared • Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others • Memory is always shared 30
  • 31. “Fish” machines hyper-threads • Dual-core Intel Xeon processors CORE1 CORE0 • Each core is L1 cache L1 cache hyper-threaded L2 cache • Private L1 caches memory • Shared L2 caches 31
  • 32. CORE1 Designs with private L2 caches CORE0 CORE1 CORE0 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2 32
  • 33. Private vs shared caches? • Advantages/disadvantages? 33
  • 34. Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 34
  • 35. The cache coherence problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores 35
  • 36. The cache coherence problem Suppose variable x initially contains 15213 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory x=15213 36
  • 37. The cache coherence problem Core 1 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 multi-core chip Main memory x=15213 37
  • 38. The cache coherence problem Core 2 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 38
  • 39. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory assuming x=21660 write-through 39 caches
  • 40. The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory x=21660 40
  • 41. Solutions for cache coherence • This is a general problem with multiprocessors, not limited just to multi-core • There exist many solution algorithms, coherence protocols, etc. • A simple solution: invalidation-based protocol with snooping 41
  • 42. Inter-core bus Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory inter-core bus42
  • 43. Invalidation protocol with snooping • Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated • Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. 43
  • 44. The cache coherence problem Revisited: Cores 1 and 2 have both read x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 44
  • 45. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 sends INVALIDATED invalidation multi-core chip request Main memory assuming inter-core x=21660 write-through 45 bus caches
  • 46. The cache coherence problem After invalidation: Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 multi-core chip Main memory x=21660 46
  • 47. The cache coherence problem Core 2 reads x. Cache misses, and loads the new copy. Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=21660 multi-core chip Main memory x=21660 47
  • 48. Alternative to invalidate protocol: update protocol Core 1 writes x=21660: Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=21660 UPDATED broadcasts multi-core chip updated value Main memory assuming inter-core x=21660 write-through 48 bus caches
  • 49. Which do you think is better? Invalidation or update? 49
  • 50. Invalidation vs update • Multiple writes to the same location – invalidation: only the first time – update: must broadcast each write (which includes new variable value) • Invalidation generally performs better: it generates less bus traffic 50
  • 51. Invalidation protocols • This was just the basic invalidation protocol • More sophisticated protocols use extra cache state bits • MSI, MESI (Modified, Exclusive, Shared, Invalid) 51
  • 52. Programming for multi-core • Programmers must use threads or processes • Spread the workload across multiple cores • Write parallel algorithms • OS will map threads/processes to cores 52
  • 53. Thread safety very important • Pre-emptive context switching: context switch can happen AT ANY TIME • True concurrency, not just uniprocessor time-slicing • Concurrency bugs exposed much faster with multi-core 53
  • 54. However: Need to use synchronization even if only time-slicing on a uniprocessor int counter=0; void thread1() { int temp1=counter; counter = temp1 + 1; } void thread2() { int temp2=counter; counter = temp2 + 1; } 54
  • 55. Need to use synchronization even if only time-slicing on a uniprocessor temp1=counter; counter = temp1 + 1; gives counter=2 temp2=counter; counter = temp2 + 1 temp1=counter; temp2=counter; gives counter=1 counter = temp1 + 1; counter = temp2 + 1 55
  • 56. Assigning threads to the cores • Each thread/process has an affinity mask • Affinity mask specifies what cores the thread is allowed to run on • Different threads can have different masks • Affinities are inherited across fork() 56
  • 57. Affinity masks are bit vectors • Example: 4-way multi-core, without SMT 1 1 0 1 core 3 core 2 core 1 core 0 • Process/thread is allowed to run on cores 0,2,3, but not on core 1 57
  • 58. Affinity masks when multi-core and SMT combined • Separate bits for each simultaneous thread • Example: 4-way multi-core, 2 threads per core 1 1 0 0 1 0 1 1 core 3 core 2 core 1 core 0 thread thread thread thread thread thread thread thread 1 0 1 0 1 0 1 0 • Core 2 can’t run the process • Core 1 can only use one simultaneous 58 thread
  • 59. Default Affinities • Default affinity mask is all 1s: all threads can run on all processors • Then, the OS scheduler decides what threads run on what core • OS scheduler detects skewed workloads, migrating threads to less busy processors 59
  • 60. Process migration is costly • Need to restart the execution pipeline • Cached data is invalidated • OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core • This is called soft affinity 60
  • 61. Hard affinities • The programmer can prescribe her own affinities (hard affinities) • Rule of thumb: use the default scheduler unless a good reason not to 61
  • 62. When to set your own affinities • Two (or more) threads share data-structures in memory – map to same core so that can share cache • Real-time threads: Example: a thread running a robot controller: - must not be context switched, or else robot can go unstable Source: Sensable.com - dedicate an entire core just to this thread 62
  • 63. Kernel scheduler API #include <sched.h> int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask); Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’. ‘len’ is the system word size: sizeof(unsigned int long) 63
  • 64. Kernel scheduler API #include <sched.h> int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask); Sets the current affinity mask of process ‘pid’ to *mask ‘len’ is the system word size: sizeof(unsigned int long) To query affinity of a running process: [barbic@bonito ~]$ taskset -p 3935 pid 3935's current affinity mask: f 64
  • 65. Windows Task Manager core 2 core 1 65
  • 66. Legal licensing issues • Will software vendors charge a separate license per each core or only a single license per chip? • Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core 66
  • 67. Conclusion • Multi-core chips an important new trend in computer architecture • Several new multi-core chips in design phases • Parallel programming techniques likely to gain importance 67