Operating Systems - Architecture

Operating Systems
CMPSCI 377
Architecture
Emery Berger
University of Massachusetts Amherst

UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science

Architecture
Hardware Support for Applications & OS


Architecture basics & details


Focus on characteristics exposed to


application programmer / OS

UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2

The Memory Hierarchy
Registers


Caches


Associativity


Misses


Locality



Registers
Register = dedicated name for word of


memory managed by CPU
General-purpose: “AX”, “BX”, “CX” on x86

SP
Special-purpose:
 arg0
arg1
arg0
“SP” = stack pointer
 arg1
arg2
“FP” = frame pointer FP


“PC” = program counter



Registers
Register = dedicated name for one word of


memory managed by CPU
General-purpose: “AX”, “BX”, “CX” on x86

SP
Special-purpose:
 arg0
arg1
“SP” = stack pointer


“FP” = frame pointer FP


“PC” = program counter


Change processes:


save current registers &
load saved registers =
context switch

Caches
Access to main memory: “expensive”


~ 100 cycles (slow, relatively cheap)


Caches: small, fast, expensive memory


Hold recently-accessed data (D$) or


instructions (I$)
Different sizes & locations


Level 1 (L1) – on-chip, smallish


Level 2 (L2) – on or next to chip, larger


Level 3 (L3) – pretty large, on bus


Manages lines of memory (32-128 bytes)



Memory Hierarchy
Higher = small, fast, more $, lower latency


Lower = large, slow, less $, higher latency


registers 1-cycle latency

2-cycle latency
L1
evict
load

D$, I$ separate
L2 7-cycle latency
D$, I$ unified

RAM 100 cycle latency

Disk 40,000,000 cycle latency

Network 200,000,000+ cycle latency


Cache Jargon
Cache initially cold


Accessing data initially misses


Fetch from lower level in hierarchy


Bring line into cache (populate cache)


Next access: hit


Once cache holds most-frequently used


data: “warmed up”

Context switch implications?



Cache Details
Ideal cache would be fully associative


That is, LRU (least-recently used) queue


Generally too expensive


Instead, partition memory addresses and

put into separate bins divided into ways
1-way or direct-mapped


2-way = 2 entries per bin


4-way = 4 entries per bin, etc.



Associativity Example
Hash memory based on addresses to


different indices in cache


Miss Classification
First access = compulsory miss


Unavoidable without prefetching


Too many items in way = conflict miss


Avoidable if we had higher associativity


No space in cache = capacity miss


Avoidable if cache were larger


Invalidated = coherence miss


Avoidable if cache were unshared



Exercise
Cache with 4 entries, 2-way associativity


Assume hash(x) = x % 4 (modulus)


How many misses?


# compulsory misses?


# conflict misses?


# capacity misses?



Solution




How many misses?


# compulsory misses? 10


# conflict misses?


# capacity misses?


3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10


Solution




How many misses?




# conflict misses? 2


# capacity misses?


3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10


Solution




How many misses?




# conflict misses? 2


# capacity misses? 0


3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10


Locality
Locality = re-use of recently-used items


Temporal locality: re-use in time


Spatial locality: use of nearby items


In same cache line, same page (4K chunk)


Intuitively – greater locality = fewer misses


# misses depends on cache layout, # of levels,


associativity…
Machine-specific



Quantifying Locality
Instead of counting misses,


compute hit curve from LRU histogram
Assume perfect LRU cache


Ignore compulsory misses


3 7 7 2 3 7

7
3
1 2 3 4 5 6








3 7 7 2 3 7

7
3
1 2 3 4 5 6








3 7 7 2 3 7

2
7
3
1 2 3 4 5 6








3 7 7 2 3 7

3
2
7
1 2 3 4 5 6




Start with total misses on right hand side


Subtract histogram values


1 1 3 3 3 3

1 2 3 4 5 6




Start with total misses on right hand side


Subtract histogram values


Normalize

100%

.3 .3 1 1 1 1
67%

33%

0%
1 2 3 4 5

Hit Curve Exercise
Derive hit curve for following trace:


3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10


Hit Curve Exercise


1 2 3 4 5 6 7 8 9

3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10


Hit Curve Exercise


1 2 2 2 3 3 4 5 6

1 2 3 4 5 6 7 8 9

3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10


Hit Curve Exercise


1 2 2 2 3 3 4 5 6

100%

67%

33%

0%
1 2 3 4 5 6 7 8 9


Important CPU Internals
Issues that affect performance


Pipelining


Branches & prediction


System calls (kernel crossings)



Scalar architecture + memory…
Straight-up sequential execution


Fetch instruction


Decode it


Execute it


Problem: instruction or data miss in cache


Result – stall: everything stops


How long to wait for miss all the way to


RAM?


Superscalar architectures
Out-of-order processors


Pipeline of instructions in flight


Instead of stalling on load, guess!


Branch prediction


Value prediction


Predictors based on history, location in program


Speculatively execute instructions


Actual results checked asynchronously


If mispredicted, squash instructions


Accurate prediction = massive speedup


Hides latency of memory hierarchy


Pipelining and Branches

Pipelining overlaps instructions to
exploit parallelism, allowing the clock
rate to be increased. Branches cause
bubbles in the pipeline, where some
stages are left idle.

Instruction fetch
Instruction decode
Execute
Memory access
Write back
Unresolved branch

Branch Prediction

A branch predictor allows the processor
to speculatively fetch and execute
instructions down the predicted path.

Instruction fetch
Instruction decode
Execute
Memory access
Write back
Speculative execution


Kernel Mode
Protects OS from users


kernel = English for nucleus


Think atom


Only privileged code executes in kernel


System call –


Enters kernel mode


Flushes pipeline, saves context


Executes code in kernel land


Returns to user mode, restoring context


Where we are in user land



Timers & Interrupts
Need to respond to events periodically


Change executing processes


Quantum – time limit for process execution


Fairness – when timer goes off, interrupt


Current process stops


OS takes control through interrupt handler


Scheduler chooses next process


Interrupts also signal I/O events


Network packet arrival, disk read complete…



To do
Read C/C++ notes for next week


First homework assigned next week


Language: C/C++


Will be due in 2 weeks



The End


Operating Systems - Architecture

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Operating Systems - Architecture

Similar to Operating Systems - Architecture (20)

More from Emery Berger

More from Emery Berger (15)

Recently uploaded

Recently uploaded (20)

Operating Systems - Architecture