Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Operating Systems - Architecture
1. Operating Systems
CMPSCI 377
Architecture
Emery Berger
University of Massachusetts Amherst
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
2. Architecture
Hardware Support for Applications & OS
Architecture basics & details
Focus on characteristics exposed to
application programmer / OS
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2
3. The Memory Hierarchy
Registers
Caches
Associativity
Misses
Locality
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3
4. Registers
Register = dedicated name for word of
memory managed by CPU
General-purpose: “AX”, “BX”, “CX” on x86
SP
Special-purpose:
arg0
arg1
arg0
“SP” = stack pointer
arg1
arg2
“FP” = frame pointer FP
“PC” = program counter
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4
5. Registers
Register = dedicated name for one word of
memory managed by CPU
General-purpose: “AX”, “BX”, “CX” on x86
SP
Special-purpose:
arg0
arg1
“SP” = stack pointer
“FP” = frame pointer FP
“PC” = program counter
Change processes:
save current registers &
load saved registers =
context switch
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5
6. Caches
Access to main memory: “expensive”
~ 100 cycles (slow, relatively cheap)
Caches: small, fast, expensive memory
Hold recently-accessed data (D$) or
instructions (I$)
Different sizes & locations
Level 1 (L1) – on-chip, smallish
Level 2 (L2) – on or next to chip, larger
Level 3 (L3) – pretty large, on bus
Manages lines of memory (32-128 bytes)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6
7. Memory Hierarchy
Higher = small, fast, more $, lower latency
Lower = large, slow, less $, higher latency
registers 1-cycle latency
2-cycle latency
L1
evict
load
D$, I$ separate
L2 7-cycle latency
D$, I$ unified
RAM 100 cycle latency
Disk 40,000,000 cycle latency
Network 200,000,000+ cycle latency
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7
8. Cache Jargon
Cache initially cold
Accessing data initially misses
Fetch from lower level in hierarchy
Bring line into cache (populate cache)
Next access: hit
Once cache holds most-frequently used
data: “warmed up”
Context switch implications?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8
9. Cache Details
Ideal cache would be fully associative
That is, LRU (least-recently used) queue
Generally too expensive
Instead, partition memory addresses and
put into separate bins divided into ways
1-way or direct-mapped
2-way = 2 entries per bin
4-way = 4 entries per bin, etc.
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9
10. Associativity Example
Hash memory based on addresses to
different indices in cache
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10
11. Miss Classification
First access = compulsory miss
Unavoidable without prefetching
Too many items in way = conflict miss
Avoidable if we had higher associativity
No space in cache = capacity miss
Avoidable if cache were larger
Invalidated = coherence miss
Avoidable if cache were unshared
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11
12. Exercise
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses?
# conflict misses?
# capacity misses?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12
13. Solution
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses? 10
# conflict misses?
# capacity misses?
3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13
14. Solution
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses? 10
# conflict misses? 2
# capacity misses?
3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14
15. Solution
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses? 10
# conflict misses? 2
# capacity misses? 0
3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15
16. Locality
Locality = re-use of recently-used items
Temporal locality: re-use in time
Spatial locality: use of nearby items
In same cache line, same page (4K chunk)
Intuitively – greater locality = fewer misses
# misses depends on cache layout, # of levels,
associativity…
Machine-specific
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 16
17. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
3 7 7 2 3 7
7
3
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17
18. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
3 7 7 2 3 7
7
3
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18
19. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
3 7 7 2 3 7
2
7
3
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19
20. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
3 7 7 2 3 7
2
7
3
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20
21. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
3 7 7 2 3 7
3
2
7
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21
22. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
3 7 7 2 3 7
3
2
7
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 22
23. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Start with total misses on right hand side
Subtract histogram values
1 1 3 3 3 3
1 2 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 23
24. Quantifying Locality
Instead of counting misses,
compute hit curve from LRU histogram
Start with total misses on right hand side
Subtract histogram values
Normalize
100%
.3 .3 1 1 1 1
67%
33%
0%
1 2 3 4 5
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24
25. Hit Curve Exercise
Derive hit curve for following trace:
3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25
26. Hit Curve Exercise
Derive hit curve for following trace:
1 2 3 4 5 6 7 8 9
3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26
27. Hit Curve Exercise
Derive hit curve for following trace:
1 2 2 2 3 3 4 5 6
1 2 3 4 5 6 7 8 9
3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27
28. Hit Curve Exercise
Derive hit curve for following trace:
1 2 2 2 3 3 4 5 6
100%
67%
33%
0%
1 2 3 4 5 6 7 8 9
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28
29. Important CPU Internals
Issues that affect performance
Pipelining
Branches & prediction
System calls (kernel crossings)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29
30. Scalar architecture + memory…
Straight-up sequential execution
Fetch instruction
Decode it
Execute it
Problem: instruction or data miss in cache
Result – stall: everything stops
How long to wait for miss all the way to
RAM?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30
31. Superscalar architectures
Out-of-order processors
Pipeline of instructions in flight
Instead of stalling on load, guess!
Branch prediction
Value prediction
Predictors based on history, location in program
Speculatively execute instructions
Actual results checked asynchronously
If mispredicted, squash instructions
Accurate prediction = massive speedup
Hides latency of memory hierarchy
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31
32. Pipelining and Branches
Pipelining overlaps instructions to
exploit parallelism, allowing the clock
rate to be increased. Branches cause
bubbles in the pipeline, where some
stages are left idle.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Unresolved branch
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
33. Branch Prediction
A branch predictor allows the processor
to speculatively fetch and execute
instructions down the predicted path.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Speculative execution
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
34. Kernel Mode
Protects OS from users
kernel = English for nucleus
Think atom
Only privileged code executes in kernel
System call –
Enters kernel mode
Flushes pipeline, saves context
Executes code in kernel land
Returns to user mode, restoring context
Where we are in user land
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 34
35. Timers & Interrupts
Need to respond to events periodically
Change executing processes
Quantum – time limit for process execution
Fairness – when timer goes off, interrupt
Current process stops
OS takes control through interrupt handler
Scheduler chooses next process
Interrupts also signal I/O events
Network packet arrival, disk read complete…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 35
36. To do
Read C/C++ notes for next week
First homework assigned next week
Language: C/C++
Will be due in 2 weeks
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 36
37. The End
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 37