4. SPARC T1 overview
8 SPARC V9 core
4 threads per core
16 KB L1 instruction cache
(I-Cache) per core.
8KB L1 data cache (D-Cache)
per core.
5. SPARC T1 overview
3MB L2 cache
shared by all cores
4-way banked
12-way associative
132 GB/sec cross bar
interconnect for on-chip
communication.
6. SPARC T1 overview
4 DDR-II DRAM controller
144 bit interface per channel
25GB/sec total peak bandwidth.
IEE 754 compliant floating
point unit (FPU).
Shared by all core
7. SPARC T1 overview
External interface
J-Bus interface for I/O
2.56 GB/sec peak bandwidth
128 bit multiplexed address and
data bus.
Serial System Interface (SSI) for
boot PROM.
10. SPARC core overview
Instruction Fetch Unit (IFU)
Load Store Unit (LSU)
Memory Management Unit (MMU).
Execution Unit (EXU)
Multiplier Unit
Trap Logic Unit
Floating Point Front end Unit
Stream Processing Unit
11. SPARC core overview
SPARC core data path
Separate instruction
cache (I-Cache) and data
cache (D-Cache).
12. SPARC core overview
We’ll limit our discussion
within I-Cache and D-
Cache.
We’ll also include associated
TLB architecture for
supporting memory
virtualization.
13. L1 Cache and TLBs
IFU contains I-cache and
I-TLB
LSU contains D-cache
and D-TLB
MMU
IFU LSU
I-Cache
I-TLB
D-Cache
D-TLB
14. L1 Cache and TLBs
IFU controls I-Cache
content.
LSU controls D-Cache
content.
MMU controls both the I-
TLB and D-TLB
MMU
IFU LSU
I-Cache
I-TLB
D-Cache
D-TLB
15. L1 I-Cache
Physically indexed and tagged
address is translated into physical location using I-TLB before
cache hit/miss is determined.
4-way set associative.
16KB data storage with 32 bytes line size.
Single ported data and tag array.
16. L1 I-Cache
I-Cache fill size - 16 bytes per access.
Cached data contains
32-bit instruction
1-bit parity
1-bit pre-decode
Valid bit array has 1 read and 1 write port.
Cache invalidation access only V-bit array.
17. L1 I-Cache
Cache line replacement in pseudo-random.
Read access has higher priority over write access to I-
cache.
Maximum wait time for a write access is 25 SPARC
core clock cycle.
Any write access request waiting more than 25 clock cycle will
cause pipeline stall in-order allow access to complete the
pending write operation.
18. IFQ & MIL
Instruction Fill Queue (IFQ)
feeds into I-Cache.
Missed Instruction List (MIL)
stores the addresses which
missed the I-Cache or I-TLB
access.
MIL feeds into LSU for
further processing.
I-Cache
I-TLB
IFQ MIL
To
LSU
From
LSU
IFU
Address
19. Instruction fetch
For every SPARC core clock cycle 2 instruction is
fetched per instruction issue.
This strategy has been takes to reduce I-Cache read access for
opportunistic I-Cache line fill.
Each thread is allowed to have one outstanding I-
cache miss.
i.e. total 4 I-cache miss per core is allowed.
Duplicate I-cache miss do not induce redundant fill
request for L2-Cache.
20. I-Cache fill path
Fill packet (PCX) comes from L2-
cache via LSU.
Parity and pre-decode is computed
before I-cache is filled up.
CPX packet also includes
invalidations
test access point (TAP) read & write
error notification
IFQINV
BIST > ASI > CPX
BIST ASI CPXPKT
from
LSU
To
V-bit array
To
I-Cache
Bypass To
TIR
21. I-Cache fill path
Invalidation CPX is handled
through INV block.
access V-bit array
IFQ has a bypass circuit to deliver
current CPX directly to Thread
Instruction Register (TIR) to avoid
extra stall in processing instruction.
IFQINV
BIST > ASI > CPX
BIST ASI CPXPKT
from
LSU
To
V-bit array
To
I-Cache
Bypass To
TIR
22. I-Cache fill path
Each I-cache fill takes 2 CPX, 16
bytes each.
I-cache line size is 32 bytes.
I-cache line is invalidated after first
packet is written.
I-cache line becomes valid again
after 2nd packet is written.
IFQINV
BIST > ASI > CPX
BIST ASI CPXPKT
from
LSU
To
V-bit array
To
I-Cache
Bypass To
TIR
V I
WRITE
CPX-1
WRITE
CPX-2
V = Valid
I = Invalid
WRITE
CPX-1
23. I-Cache miss path
Missed Instruction List (MIL) sends
I-Cache miss request to L2-Cache
using LSU.
One miss entry per thread, i.e. total
4 miss entry per SPARC core.
Each entry in MIL contains
physical address (PA).
The replacement way information.
The MIL state information.
The cacheability.
The error information
MIL
Physical
Address
(PA)
RR Arbitrator
COMP
PCXPKT
to
LSU
24. I-Cache miss path
PA keeps track of I-cache fetch
progress from I-cache miss till I-
cache fill.
Round robin algorithm to dispatch
I-cache fill request from different
threads.
MIL uses linked list, of size 4, to
keep track of duplicate I-cache
miss.
Marks duplicate request as child.
Any child request is serviced as soon as
parent request gets response.
MIL
Physical
Address
(PA)
RR Arbitrator
COMP
PCXPKT
to
LSU
25. I-Cache miss path
S1 S3
S2
S4
Make
Fill
Request
CPX-1
not done
Send
Speculative
notification
New
I-Cache
Miss
CPX-2
not doneSend
notification
MIL alters between 4
states.
starts with S1 upon new I-
cache miss.
Makes fill request.
Wait till I-cache fill is done.
Upon completing CPX-1
fill, it sends speculative
completion notification to
thread scheduler.
26. I-Cache miss path
An I-Cache fill request may be cancelled upon trap or
exception.
However, MIL still goes through the filling a cache line, but the bypass to
TIR is blocked.
Why ? because, the pending child request should be serviced even if the
parent request is cancelled.
Child I-cache miss request needs to wait till the parent’s I-
cache miss request is serviced.The child instruction fetch
shall be retired (rollback) to fetch stage to allow it to access
the I-cache.This is referred as ‘miss-fill crossover’ .
27. L1 D-Cache
4-way set associative
8 KB data storage with 16 byte line size.
Single read-write port for data and tag array.
Dual ported Valid bit (V-bit) array.
cache invalidation only access this V-bit array.
28. L1 D-Cache
Cache line replacement policy is pseudo random,
using a linear shift register, with allocated load miss,
but non-allocated store miss.
A cacheable load miss will allocate a line and will
execute the write through policy before the line is
loaded.
Stores do not allocate. Hence, store will causes line
invalidation if the target address is already in D-
cache, as determined by L2 cache directory.
29. L1 D-Cache
L1 D-cache is always inclusive to L2 cache.
L1 D-cache is always exclusive to L1 I-cache.
Each L1 D-cache is parity protected.
Parity error will cause D-cache miss, hence data will be
corrected.
In addition to pipeline read, L1 D-cache may be
accessed by ASI, BIST and RAM-test through test
access port (TAP).
30. Data flow through LSU
One store buffer (STB)
per thread.
Load misses are kept in
Load Miss Queue,
LMQ.
One outstanding load miss
per thread.
Load miss with duplicate
physical address (PA) is not
sent to L2-cache.
Fully associative DTLB
All CAM/RAM accesses are
single cycle operation.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
31. Data flow through LSU
STB consists of store
buffer CAM (SCM) and
store data array
(STBDATA).
SCM has 1 CAM port and
1 RW port
STBDATA has 1 read and
1 write port.
Each thread is
allocated with 8 fixed
entries in the shared
data structure.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
32. Data flow through LSU
A load instruction
speculate on D-cache
miss to reduce the CCX
access latency.
If speculation fails, load
instruction is taken out of
LMQ.
The arbiter (PCX
generator) takes 13
different inputs to
generate the packet to
PCX (Processor-to-
Crossbar interface).
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
33. Data flow through LSU
The arbiter inputs
consist of
4 load type instructions
4 store type instructions
One I-cache fill.
One FPU access.
One SPU access.
One interrupt.
One forward packet.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
34. Data flow through LSU
The arbitration inputs
consist of
I-cache miss
Load miss
Stores
{FPU operations, SPU
operations, Interrupts}
A two level history
mechanism allows to
implement fair
scheduling among
different priority levels.
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
35. Data flow through LSU
In coming packets are
stored in the data fill
queue (DFQ).
Packets can be
Acknowledgment
Data
The targets for DFQ
are
Instruction fetch unit
(IFU)
Load Store Unit (LSU)
Trap Logic Unit (TLU)
Stream Processing Unit
(SPU)
I-Cache
I-TLB
DFQ
To
PCX
From
CPX
LSU
Address
STB
PCX
Generator
PCX PKT
from IFU STORELOAD
IFU
LMQ
IRF,FRF
36. Memory Management Unit
Maintains content of ITLB and
DTLB.
MMU helps SPARC-T1 to provide
support for virtualization.
Multiple OS co-exists on to of CMT
processor.
Hypervisor layer virtualizes underlying
CPU.
Virtual address (VA) from
application is translated into Real
Address (RA) and then to Physical
Address (PA) using TLB & MMU.
37. Data Flow in MMU
The system software maintains the content of TLBs by
sending instructions to MMU.
Instructions are - read, write, de-map.
TLB entries are shared among threads.
Consistency among TLB entries are maintained through
auto-de-map.
MMU is responsible for
Generating the pointers to Software Translation Storage Buffer (STB).
Maintains fault status for various traps.
Access to MMU is through hypervisor-managed ASI
(Alternate Space Identifier) operations, e.g. ldxa, stxa.
39. TLB structure
TLB consists of Content Addressable Memory (CAM)
and Random Access Memory (RAM).
CAM has 1 compare port and 1 read-write port.
RAM has 1 read-write port.
TLB support the mutually exclusive events of - CAM,
Read,Write, Bypass, De-map, Soft-reset, Hard-reset.
40. TLB structure
RAM contains the following fields.
Physical Address (PA).
Attributes.
CAM contains the following fields.
Partition ID (PID).
Real (indicates VA-to-PA or RA-to-PA translation)
Virtual address (VA), divided into page size based fields (V0 - V3)
Context ID (CTXT)
41. TLB entry replacement algorithm
Each entry has an used bit.
Replacement is picked up by the least significant
unused bit among all 64 entries.
A used bit is set on - write, CAM hit or lock.
A locked page has always the used bit set.
Entry invalidation will clear the used bit.
All used bit will be cleared, except the locked entry, if
TLB reaches saturation.
If TLB is saturated for all locked entry, default location
0x63 s chosen and error is reported.
42. L2-cache overview
3MB in total size with four
symmetrical data bank.
Each bank operates
independently.
Each bank is 12-way set
associative and 768KB of size.
Line size is 64 bytes.
Number of sets are 1024.
43. L2-cache overview
Accepts request from
processor-to-crossbar (PCX)
interface - a part of CCX.
Puts response on crossbar-to-
processor (CPX) interface - a
part of CCX.
Responsible to maintain on-
chip coherency across all L1-
cache.
Keeps copy of all L1 tags in a
directory structure.
44. L2-cache overview
128-bit fill interface.
64-bit write interface with
DMA controller.
Each bank has dedicated
DMA controller.
8-staged pipe lined cache
controller.
45. L2-cache overview
32-bit word is protected by 7-
bit single error correction,
double error detection (SEC/
DED) ECC code.
J-bus interface (JBI) using
snoop input queue and
RDMA write buffer.
46. L2-Cache structure
3 main components.
SCTAG (Secondary Cache TAG) :
contains TAG array,VUAD array, L2-
TAG directory and cache controller.
SCBUF (Secondary Cache BUF) :
contains write back buffer (WBB),
fill buffer (FB) and DMA buffer.
SCDATA (Secondary Cache
DATA) : contains L2-cache data.
47. L2-cache : Arbiter
Manages L2-cache pipeline access from various source
of request access.
Arbiter gets input from
Instruction from CCX and bypass path for input
queue (IQ).
DMA instruction from snoop input queue
Instructions for re-cycle from miss buffer (MB) and
fill buffer (FB).
Stall signal from the pipeline.
48. L2-cache :TAG
22-bit tag with 6-bit of SEC ECC protection.
No double bit error detection.
Single ported array.
Four states are maintained per tag line in VUAD array
Valid (V)
Used (U)
Allocated (A)
Dirty (D)
49. L2-cache :VUAD
Dual ported array structure.
VAD bits are parity protected since an error will be fatal
Used bit is not protected, since the error is not fatal.
VUAD is accessed while taking decision for line
replacement.
50. L2-cache : DATA
Single ported SRAM structure.
768 KB in size with 64 bytes logical line size.
Allows read access of 16 bytes and 64 bytes.
‘16-byte enable’ allows writing in 4-byte part.
Line fill updates all 64 bytes at a time.
51. L2-cache : DATA
Data array is subdivided into 4 column with six 32
Kbyte sub-array in each column.
Data array access needs 2 cycles to be completed.
No column can be accessed in consecutive cycle.
All accesses are pipelined, thus access have a through
put of one per cycle.
Each 32-bit line is protected by 7 bits of SEC/DED
ECC.
52. L2-cache : Input Queue(IQ)
16-entry FIFO queue takes incoming PCX packets.
Each entry is 130 bit wide.
FIFO implemented with dual ported array.
IQ asserts a stall when 11 entries are filled up.
To allow incoming packets already in fly.
53. L2-cache : Output Queue(OQ)
16-entry FIFO for the packets waiting to get access to
CPX.
Each entry is 146-bit wide.
FIFO implemented with dual ported array.
When OQ reaches high-water mark, L2-cache stops
accepting PCX and input from miss buffer.
Fills can still be happened since they do not cause CPX packets.
54. L2-cache : Miss Buffer (MB)
16-entry miss buffer stores instructions which can not be
processed as simple cache hit.
True L2 cache miss.
Same cache line address which had a miss.
An entry in the write back buffer.
Instructions need multiple L2 cache pipeline.
Unallocated L2-cache misses.
Access causing tag ECC error.
Non tag part holds data - it is a RAM with 1R 1W port.
Tag part holds address - it is a CAM with 1R,1W and 1CAM port.
55. L2-cache : Fill Buffer (FB)
8-entry buffer.
Contains cache-line wide entry to stage data from DRAM before it
fills the cache.
RAM structure is used for implementation.
Address is also stored to maintain the age ordering to satisfy data
coherence.
CAM structure is used for implementation.
Data arrives from DRAM in four 16 byte block starting with the
critical quad-word.
56. L2-cache :Write Back Buffer (WBB)
8-entry buffer, used to store 64-byte dirty data upon eviction.
The evicted lines are streamed to DRAM opportunistically.
An instruction having same address line as in WBB, the
instruction is pushed back into MB.
WBB also has RAM and CAM part to hold data and address
respectively.
64-byte read interface with data array and 64-bit write interface to
DRAM controller.
57. L2-cache : Directory
2048 entries, with one entry per L1 tag.
It is L1 tag to L2 bank mapping.
Half entries are for L1 I-cache and other half is for L1 D-cache.
I-cache dir and D-cache dir.
Participates in coherency management.
Also ensures same line is not a part of I-Cache and D-Cache.
58. Uses pseudo LRU for line replacement.
The ‘U’ bit (total 12, 1 per way) is set upon cache hit
All 12 ‘U’ bits get cleared when there is no unused or
unallocated.
‘A’-bit means the line is allocated for a miss.
Analogous to ‘lock’ bit.
‘A’ -bit gets cleared when the line fill happens.
L2-cache : Line Replacement Algorithm
59. ‘D’-bit indicates the line is only valid inside cache and required
to be written back.
Set when data written to L2-cache.
Cleared when line is invalidated.
LRU examines all the ways from a certain point based one a
round-robin fashion.
The first unused-unallocated line is allocated for miss.
If no unused, first unallocated line is allocated for miss.
L2-cache : Line Replacement Algorithm
60. Scope of future study
Cache cross bar (CCX) for data transaction.
L2-cache pipelined data flow control.
Cache memory consistency and instruction
ordering.