Intel® Nehalem
Micro-Architecture
Mohammad Radpour
Amirali Sharifian
1
2
Outline
• Brief History
• Overview
• Memory System
• Core Architecture
• Hyper- Threading Technology
• Quick Path Interc...
3
Brief History
about Intel Processors
3
4
Nehalem System Example:
4
5
Building Blocks
5
6
How to make Silicon Die?
6
7
Overview of Nehalem
Processor Chip
• Four identical compute core
• UIU: Un-core interface unit
• L3 cache memory
and
dat...
8
Overview of Nehalem
Processor Chip(cont.)
• IMC : Integrated Memory Controller with 3 DDR3 memory channels
• QPI : Quick...
9
Overview of Nehalem
Processor Chip(cont.)
• Chip is divided into two domains:
“Un-core” and “core”
• “Core” components o...
10
Memory System
and Core Architecture
10
11
Nehalem Memory Hierarchy
Overview
11
12
Cache Hierarchy Latencies
• L1 32KB 8-way, Latency 4 cycles
• L2 256KB 8-way, Latency < 12 cycles
• L3 8MB shared , 16-...
13
Intel® Smart Cache – Level 3
13
14
Nehalem Microarchitecture
14
15
Instruction Execution
15
16
Instruction Execution (1/5)
1. Instructions fetched
from L2 cache
16
17
Instruction Execution (2/5)
1. Instructions fetched
from L2 cache
2. Instructions
decoded, prefetche
d and queued
17
18
Instruction Execution (3/5)
1. Instructions fetched
from L2 cache
2. Instructions
decoded, prefetche
d and queued
3. In...
19
Instruction Execution (4/5)
1. Instructions fetched
from L2 cache
2. Instructions
decoded, prefetche
d and queued
3. In...
20
Instruction Execution (5/5)
1. Instructions fetched
from L2 cache
2. Instructions
decoded, prefetched
and queued
3. Ins...
21
Caches and Memory
21
22
Caches and Memory (1/5)
1. 4-way set
associative
instruction cache
22
23
Caches and Memory (2/5)
1. 4-way set
associative
instruction cache
2. 8-way set
associative L1 data
cache (32 KB)
23
24
Caches and Memory (3/5)
1. 4-way set
associative
instruction cache
2. 8-way set
associative L1 data
cache (32 KB)
3. 8-...
25
Caches and Memory (4/5)
1. 4-way set associative
instruction cache
2. 8-way set associative
L1 data cache (32 KB)
3. 8-...
26
Caches and Memory (5/5)
1. 4-way set associative
instruction cache
2. 8-way set associative
L1 data cache (32 KB)
3. 8-...
27
Components
1. Instructions fetched
from L2 cache
2. Instructions
decoded, prefetched
and queued
3. Instructions
optimiz...
28
Components: Fetch
1. Instructions fetched
from L2 cache
– 32 KB instruction cache
– 2-level TLB
• L1
– Instructions:
7-...
29
Components: Decode
2. Instructions
decoded, prefetche
d and queued
– 16 byte prefetch buffer
– 18-op instruction queue
...
30
Components: Optimization
3. Instructions
optimized and
combined
– 4 op decoders
• Enables multiple
instructions per cyc...
31
Components: Execution
4. Instructions
executed
– 4 FPUs
• MUL, DIV, STOR, LD
– 3 ALUs
– 2 AGUs
• Address generation
– 3...
32
Components: Write-Back
5. Results written
– Private L1/L2 cache
32
33
Components: Write-Back
5. Results written
– Private L1/L2 cache
– Shared L3 cache
– QuickPath
• Dedicated channel to
an...
34
End
34
Upcoming SlideShare
Loading in...5
×

Nehalem (microarchitecture)

1,554

Published on

This is a presentation about Intel Nehalm Micro-architecture and the reference of the slides is Intel site

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,554
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
101
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Nehalem (microarchitecture)

  1. 1. Intel® Nehalem Micro-Architecture Mohammad Radpour Amirali Sharifian 1
  2. 2. 2 Outline • Brief History • Overview • Memory System • Core Architecture • Hyper- Threading Technology • Quick Path Interconnect (QPI) 2
  3. 3. 3 Brief History about Intel Processors 3
  4. 4. 4 Nehalem System Example: 4
  5. 5. 5 Building Blocks 5
  6. 6. 6 How to make Silicon Die? 6
  7. 7. 7 Overview of Nehalem Processor Chip • Four identical compute core • UIU: Un-core interface unit • L3 cache memory and data block memory 7
  8. 8. 8 Overview of Nehalem Processor Chip(cont.) • IMC : Integrated Memory Controller with 3 DDR3 memory channels • QPI : Quick Path Interconnect ports • Auxiliary circuitry for cache-coherence, power control, system management, performance monitoring 8
  9. 9. 9 Overview of Nehalem Processor Chip(cont.) • Chip is divided into two domains: “Un-core” and “core” • “Core” components operate with a same clock frequency of the actual Core • “Un-Core” components operate with different frequency. 9
  10. 10. 10 Memory System and Core Architecture 10
  11. 11. 11 Nehalem Memory Hierarchy Overview 11
  12. 12. 12 Cache Hierarchy Latencies • L1 32KB 8-way, Latency 4 cycles • L2 256KB 8-way, Latency < 12 cycles • L3 8MB shared , 16-way, Latency 30-40 cycles (4 core system) • L3 24MB shared, 24-way, Latency 30-60 cycles(8 core system) • DRAM , Latency ~ 180 – 200 cycles 12
  13. 13. 13 Intel® Smart Cache – Level 3 13
  14. 14. 14 Nehalem Microarchitecture 14
  15. 15. 15 Instruction Execution 15
  16. 16. 16 Instruction Execution (1/5) 1. Instructions fetched from L2 cache 16
  17. 17. 17 Instruction Execution (2/5) 1. Instructions fetched from L2 cache 2. Instructions decoded, prefetche d and queued 17
  18. 18. 18 Instruction Execution (3/5) 1. Instructions fetched from L2 cache 2. Instructions decoded, prefetche d and queued 3. Instructions optimized and combined 18
  19. 19. 19 Instruction Execution (4/5) 1. Instructions fetched from L2 cache 2. Instructions decoded, prefetche d and queued 3. Instructions optimized and combined 4. Instructions executed 19
  20. 20. 20 Instruction Execution (5/5) 1. Instructions fetched from L2 cache 2. Instructions decoded, prefetched and queued 3. Instructions optimized and combined 4. Instructions executed 5. Results written 20
  21. 21. 21 Caches and Memory 21
  22. 22. 22 Caches and Memory (1/5) 1. 4-way set associative instruction cache 22
  23. 23. 23 Caches and Memory (2/5) 1. 4-way set associative instruction cache 2. 8-way set associative L1 data cache (32 KB) 23
  24. 24. 24 Caches and Memory (3/5) 1. 4-way set associative instruction cache 2. 8-way set associative L1 data cache (32 KB) 3. 8-way set associative L2 data cache (256 KB) 24
  25. 25. 25 Caches and Memory (4/5) 1. 4-way set associative instruction cache 2. 8-way set associative L1 data cache (32 KB) 3. 8-way set associative L2 data cache (256 KB) 4. 16-way shared L3 cache (8 MB) 25
  26. 26. 26 Caches and Memory (5/5) 1. 4-way set associative instruction cache 2. 8-way set associative L1 data cache (32 KB) 3. 8-way set associative L2 data cache (256 KB) 4. 16-way shared L3 cache (8 MB) 5. 3 DDR3 memory connections 26
  27. 27. 27 Components 1. Instructions fetched from L2 cache 2. Instructions decoded, prefetched and queued 3. Instructions optimized and combined 4. Instructions executed 5. Results written 27
  28. 28. 28 Components: Fetch 1. Instructions fetched from L2 cache – 32 KB instruction cache – 2-level TLB • L1 – Instructions: 7-128 entries – Data: 32-64 entries • L2 – 512 data or instruction entries – Shared between SMT threads 28
  29. 29. 29 Components: Decode 2. Instructions decoded, prefetche d and queued – 16 byte prefetch buffer – 18-op instruction queue – MacroOp fusion • Combine small instructions into larger ones – Enhanced branch prediction 29
  30. 30. 30 Components: Optimization 3. Instructions optimized and combined – 4 op decoders • Enables multiple instructions per cycle – 28 MicroOp queue • Pre-fusion buffer – MicroOp fusion • Create 1 “instruction” from MicroOps – Reorder buffer • Post-fusion buffer 30
  31. 31. 31 Components: Execution 4. Instructions executed – 4 FPUs • MUL, DIV, STOR, LD – 3 ALUs – 2 AGUs • Address generation – 3 SSE Units • Supports SSE4 – 6 ports connecting the units 31
  32. 32. 32 Components: Write-Back 5. Results written – Private L1/L2 cache 32
  33. 33. 33 Components: Write-Back 5. Results written – Private L1/L2 cache – Shared L3 cache – QuickPath • Dedicated channel to another CPU, chip, or device • Replaces FSB 33
  34. 34. 34 End 34
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×