Improving Real-Time Performance on
Multicore Platforms Using MemGuard
University of Kansas
Dr. Heechul Yun
10/28/2013
Multicore

Server

Desktop

Mobile

RT/Embedded

2
Challenges: Shared Resources
T1

T2

CPU

T
1

T
2

Core
1

T
3

T
4

Core
2

T
5

T
6

Core
3

Memory Hierarchy

T
8

Cor...
Case Study
• HRT
– Synthetic real-time video capture
– P=20, D=13ms
– Cache-insensitive

• X-server
– Scrolling text on a ...
HRT Time Distribution

solo

w/ Xserver

99pct: 10.2ms

99pct: 14.3ms

• 28% deadline violations
• Due to contention in DR...
Outline
• Motivation
• Background
– DRAM basics
– Worst-case memory performance
– MemGuard*RTAS’13+

• Improving Real-Time...
Background: DRAM Organization
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Bank
1

Bank
2

Bank
3

B...
Best-case
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Fast
• Peak = 10.6 GB/s

Bank
1

Bank
2

Bank...
Best-case
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Fast
• Peak = 10.6 GB/s

Bank
1

Bank
2

Bank...
Most-cases
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Mess
• Performance = ??

Bank
1

Bank
2

Ban...
Worst-case
Core1

Core2

Core3

Core4

L3

Memory Controller (MC)

DRAM DIMM

Slow
• 1bank b/w

Bank
1

Bank
2

Bank
3

(*...
Background: DRAM Operation
Bank 1
Row 1
Row 2
Row 3
Row 4
Row 5
activate

READ (Bank 1, Row 3, Col 7)
precharge
Col7

Row ...
Real Worst-case
Core
1

Core
2

Core
3

Core
4

Request order
time

L3

Memory Controller (MC)

DRAM DIMM

Bank
1

Bank
2
...
Background: Memory Controller(MC)
Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

• Request queue(s)
– N...
Challenges for Real-Time Systems
• Multiple parallel resources (banks)
• Stateful bank access latency
• Queuing delay

• U...
MemGuard *RTAS’13+
MemGuard

Operating System

Reclaim Manager

BW
0.6GB/s
Regulator

BW
0.2GB/s
Regulator

BW
0.2GB/s
Reg...
Reservation
• Idea
– Scheduler regulates per-core memory b/w using h/w counters
– Period = 1 scheduler tick (e.g., 1ms)
Su...
Reservation

18
Best-Effort Sharing
time(ms)

Core0

Core1

900MB/s

300MB/s

0
throttled
reschedule
1
guaranteed b/w

2

best-effort b/w
...
Case Study
• HRT
– Synthetic real-time video capture
– P=20, D=13ms
– Cache-insensitive

• X-server
– Scrolling text on a ...
w/o MemGuard

HRT (solo)
HRT’s 99pct: 10.2ms

HRT (w/ Xserver)
HRT’s 99pct: 14.3ms
X’s CPU util: 78%

21
MemGuard
reserve only (HRT=900MB/s, X=300MB/s)

HRT (solo)
HRT’s 99pct: 10.7ms

HRT (w/ Xserver)
HRT’s 99pct: 11.2ms
X’s C...
MemGuard
reserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing

HRT (solo)
HRT’s 99pct: 10.7ms

HRT (w/ Xserver)
HRT’s 99p...
MemGuard
reserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing

HRT (solo)
HRT’s 99pct: 10.9 ms

HRT (w/ Xserver)
HRT’s 99...
Real-Time Performance Improvement
HRT

X-server

• Using MemGuard, we can achieve
– No deadline miss for HRT
– Good X-serv...
Conclusion
• Unpredictable memory performance
– multiple resources(banks), per-bank state, unpredictable queueing delay

•...
Thank you.

27
Evaluation on Intel Core2
• T1: Synthetic video capture task (HRT)
– Period=20ms(50Hz)
– Deadline=14ms,
– Metrics: ACET, W...
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

deadline

solo

corun

T1

Private L2
Prefetch=off

Performan...
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

30%
WCET

WCET

ACET

solo

corun

T1

Private L2
Prefetch=of...
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

deadline

solo

corun

T1

Private L2
Prefetch=off

92%
T2

C...
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

deadline

solo

corun

T1

Private L2
Prefetch=off

92%
T2

C...
T1’s exec. time (ms)

Experiment 1
18
16
14
12
10
8
6
4
2
0

Performance target

solo

corun

T1

Private L2
Prefetch=off
...
T1's exec. Time (ms)

Experiment 2: Prefetcher
24
22
20
18
16
14
12
10
8
6
4
2
0

Not enough reserv.
More slowdown

deadli...
T1's exec. Time (ms)

Experiment 2-2
18
16
14
12
10
8
6
4
2
0

Enough reserv.
60%

solo

corun

T1

Private L2
Prefetch=ON...
T1's exec. Times (ms)

Experiment 3: Shared Cache
24
22
20
18
16
14
12
10
8
6
4
2
0

Even more slowdown
Minimum reserv.

1...
Upcoming SlideShare
Loading in …5
×

Improving Real-Time Performance on Multicore Platforms using MemGuard

640 views

Published on

A case-study presented at the Real-Time Linux Workshop (Oct, 2013)

Published in: Technology, Economy & Finance
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
640
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
3
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Soon more rt/embedded systems will use multicore as well.
  • In the unicore systems, CPU time is the most important shared resource determining application’s performance. In the multicore systems, however, memory performance is also very important as multiple cores can concurrently access the memory and affect performance in significant ways.
  • 5
  • Problem 1: co-ordinate memory slot with tasks  require program modification(PREM)Problem 2: only 1 core can access memory at a time  do not fully utilize memory level parallelism
  • First, let me explain how b/w regulator works.
  • Why we want to regulate the request rates?
  • 5
  • Problem: DRAM
  • Problem: DRAM
  • Problem: DRAM
  • Problem: DRAM
  • Problem: DRAM
  • Problem: DRAM
  • Problem: DRAM
  • Problem: DRAM
  • Improving Real-Time Performance on Multicore Platforms using MemGuard

    1. 1. Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013
    2. 2. Multicore Server Desktop Mobile RT/Embedded 2
    3. 3. Challenges: Shared Resources T1 T2 CPU T 1 T 2 Core 1 T 3 T 4 Core 2 T 5 T 6 Core 3 Memory Hierarchy T 8 Core 4 Memory Hierarchy Unicore T 7 Multicore Performance Impact 3
    4. 4. Case Study • HRT – Synthetic real-time video capture – P=20, D=13ms – Cache-insensitive • X-server – Scrolling text on a gnome-terminal • Hardware platform – Intel Xeon 3530 – 8MB shared L3 cache – 4GB DDR3 1333MHz DIMM (1ch) HRT Xsrv. Core1 Core2 L3 (8MB) DRAM • CPU cores are isolated A desktop PC (Intel Xeon 3530) 4
    5. 5. HRT Time Distribution solo w/ Xserver 99pct: 10.2ms 99pct: 14.3ms • 28% deadline violations • Due to contention in DRAM 5
    6. 6. Outline • Motivation • Background – DRAM basics – Worst-case memory performance – MemGuard*RTAS’13+ • Improving Real-Time Performance with MemGuard 6
    7. 7. Background: DRAM Organization Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 • Have multiple banks • Different banks can be accessed in parallel
    8. 8. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast • Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 – DDR3 1333Mhz
    9. 9. Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast • Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 – DDR3 1333Mhz • Out-of-order processors
    10. 10. Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Mess • Performance = ?? Bank 1 Bank 2 Bank 3 (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual Bank 4
    11. 11. Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Slow • 1bank b/w Bank 1 Bank 2 Bank 3 (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual Bank 4 – Less than peak b/w – How much?
    12. 12. Background: DRAM Operation Bank 1 Row 1 Row 2 Row 3 Row 4 Row 5 activate READ (Bank 1, Row 3, Col 7) precharge Col7 Row Buffer Read/write • Stateful per-bank access time – Row miss: 19 cycles – Row hit: 9 cycles (*) PC6400-DDR2 with 5-5-5 (RAS-CAS-CL latency setting)
    13. 13. Real Worst-case Core 1 Core 2 Core 3 Core 4 Request order time L3 Memory Controller (MC) DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Row 1 Row 2 Row 3 Row 4 Row 1 Row 2 … 1 bank & always row miss  ~1.2GB/s Each core = ¼ x 1.2GB/s = 300MB/s ? (*) Intel® 64 and IA-32 Architectures Optimization Reference Manual
    14. 14. Background: Memory Controller(MC) Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1. • Request queue(s) – Not fair (open-row first  re-ordering) – Unpredictable queuing delay 14
    15. 15. Challenges for Real-Time Systems • Multiple parallel resources (banks) • Stateful bank access latency • Queuing delay • Unpredictable memory performance 15
    16. 16. MemGuard *RTAS’13+ MemGuard Operating System Reclaim Manager BW 0.6GB/s Regulator BW 0.2GB/s Regulator BW 0.2GB/s Regulator BW 0.2GB/s Regulator PMC Core1 PMC Core2 PMC Core3 PMC Core4 Memory Controller Multicore Processor DRAM DIMM • Goal: guarantee minimum memory b/w for each core • How: b/w reservation + best effort sharing 16
    17. 17. Reservation • Idea – Scheduler regulates per-core memory b/w using h/w counters – Period = 1 scheduler tick (e.g., 1ms) Suspend the RT idle task 2 Budget 1 Core activity 0 1ms Schedule a RT idle task computation 2ms memory fetch 17
    18. 18. Reservation 18
    19. 19. Best-Effort Sharing time(ms) Core0 Core1 900MB/s 300MB/s 0 throttled reschedule 1 guaranteed b/w 2 best-effort b/w • Spare Sharing *RTAS’13+ • Proportional Sharing [Unpublished TR] 19
    20. 20. Case Study • HRT – Synthetic real-time video capture – P=20, D=13ms – Cache-insensitive • X-server – Scrolling text on a gnome-terminal • Hardware platform – Intel Xeon 3530 – 8MB shared cache – 4GB DDR3 1333MHz DIMM HRT Xsrv. Core1 Core2 L3 (8MB) DRAM A desktop PC (Intel Xeon 3530) 20
    21. 21. w/o MemGuard HRT (solo) HRT’s 99pct: 10.2ms HRT (w/ Xserver) HRT’s 99pct: 14.3ms X’s CPU util: 78% 21
    22. 22. MemGuard reserve only (HRT=900MB/s, X=300MB/s) HRT (solo) HRT’s 99pct: 10.7ms HRT (w/ Xserver) HRT’s 99pct: 11.2ms X’s CPU util: 4% 22
    23. 23. MemGuard reserve (HRT=900MB/s, X=300MB/s)+ best-effort sharing HRT (solo) HRT’s 99pct: 10.7ms HRT (w/ Xserver) HRT’s 99pct: 10.7ms X’s CPU util: 48% 23
    24. 24. MemGuard reserve (HRT=600MB/s, X=600MB/s)+ best-effort sharing HRT (solo) HRT’s 99pct: 10.9 ms HRT (w/ Xserver) HRT’s 99pct: 12.1ms X’s CPU util: 61% 24
    25. 25. Real-Time Performance Improvement HRT X-server • Using MemGuard, we can achieve – No deadline miss for HRT – Good X-server performance 25
    26. 26. Conclusion • Unpredictable memory performance – multiple resources(banks), per-bank state, unpredictable queueing delay • MemGuard – Guarantee minimum memory bandwidth for each core – b/w reservation (guaranteed part) + best-effort sharing • Case-study – On Intel Xeon multicore platform, using HRT + X-server – MemGuard can improve real-time performance efficiently • Limitations and Future Work – Coarse grain (a OS tick) enforcement – Small guaranteed b/w  DRAM bank partitioning (submitted to RTAS’14) https://github.com/heechul/memguard 26
    27. 27. Thank you. 27
    28. 28. Evaluation on Intel Core2 • T1: Synthetic video capture task (HRT) – Period=20ms(50Hz) – Deadline=14ms, – Metrics: ACET, WCET, stdev, deadline miss ratio (out of 1000 periods) • T2: Xserver, update screen (SRT) – Metric: CPU utilization • Higher CPU utilization  faster screen update • Platform – Intel Core2Quad 8400, 2MB L2 cache x 2, tunable H/W prefetchers – PC6400 DDR2 DRAM DIMM x 1 • Three platform configurations – Exp1: Private L2, Prefetch=off – Exp2: Private L2, Prefetch=on – Exp3: Shared L2, Prefetch=on Core0 Core1 Core2 L2 (pref.) Core3 L2 (pref.) DRAM Intel Core2Quad based PC 28
    29. 29. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 deadline solo corun T1 Private L2 Prefetch=off Performance guarantee 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 29
    30. 30. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 30% WCET WCET ACET solo corun T1 Private L2 Prefetch=off Performance guarantee deadline 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 30
    31. 31. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 deadline solo corun T1 Private L2 Prefetch=off 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 31
    32. 32. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 deadline solo corun T1 Private L2 Prefetch=off 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 32
    33. 33. T1’s exec. time (ms) Experiment 1 18 16 14 12 10 8 6 4 2 0 Performance target solo corun T1 Private L2 Prefetch=off 92% T2 Core1 Core2 L2 L2 solo corun solo corun T1 38% T2 T1 78% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 33
    34. 34. T1's exec. Time (ms) Experiment 2: Prefetcher 24 22 20 18 16 14 12 10 8 6 4 2 0 Not enough reserv. More slowdown deadline 60% solo corun T1 Private L2 Prefetch=ON Deadline violation 94% T2 Core1 Core2 L2 L2 solo corun solo corun T1 33% T2 T1 82% T2 550M/s 550M/s 550M/s 550M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 34
    35. 35. T1's exec. Time (ms) Experiment 2-2 18 16 14 12 10 8 6 4 2 0 Enough reserv. 60% solo corun T1 Private L2 Prefetch=ON No deadline violation 94% T2 Core1 Core2 L2 L2 solo corun solo corun T1 14% T2 T1 69% T2 900M/s 200M/s 900M/s 200M/s Core1 L2 Core2 L2 Core1 L2 Core2 L2 DRAM DRAM DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 35
    36. 36. T1's exec. Times (ms) Experiment 3: Shared Cache 24 22 20 18 16 14 12 10 8 6 4 2 0 Even more slowdown Minimum reserv. 108% solo corun solo corun No deadline violation solo corun T1 11% T2 T1 63% T2 T1 Shared L2 Prefetch=ON 92% T2 900M/s 200M/s 900M/s 200M/s Core1 Core2 Core1 Core2 Core1 Core2 L2 DRAM L2 DRAM L2 DRAM Original MemGuard (Reserve only) MemGuard (reclaim + share) 36

    ×