Virtualization for Emerging Memory Devices

Virtualization
for Emerging Memory Devices
1
Takahiro Hirofuchi, Ph.D.
Senior Research Scientist, Deputy Director of RWBC-OIL,
National Institute of Advanced Industrial Science and Technology (AIST)
Invited Presentation, Cool Chips 23, Apr 16, 2020
The updated version of slides will be uploaded to my homepage.
https://takahiro-hirofuchi.github.io/

DRAM is not scalable anymore …
• Energy consuming
– Need refresh energy
– More than 30% of energy
consumption of a data center
• Not scalable anymore
– Serious energy dissipation
• Memory capacity wall
• Memory bandwidth wall
2
capacitor
DRAM

High
resistance
Emerging memory devices are promising.
3
Phase Change Memory
(PCM)
Pinned layer
Free layer
Low
resistance
High
resistance
CrystalAmorphous
Low
resistance
High
resistance
Magneto-resistive Random
Access Memory (MRAM)
• Energy efficient
– No refresh energy
– Non-volatile
• Scalable in theory
– Higher density will be possible
Resistive Random Access
Memory (ReRAM)
Filaments
forming
Low
resistance
• Potential to drastically extend
the capacity of main memory
• Potential for 3D-stack
integration drastically increasing
bandwidths

• Released in April 2019
• First byte-addressable NVM based on
a next-generation memory device
– 3D-stacked resistive memory cells
– (Possibly) PCM-based
• Connected to the memory bus of CPU
via the DIMM interface
• 128-512 GB per memory module
– A DRAM module is typically 8-32 GB.
Intel Optane Data Center Persistent Memory (DCPMM)
4
Optane DCPMMDRAM
Memory
Controller
CPU

Optane DCPMMʼs Latency is higher than DRAM.
5
For read-only access, DCPMMʼs latency is 4 times larger.
For read/write access, DCPMMʼs latency is 4.1 times larger.
[1] A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module, Takahiro Hirofuchi, Ryousei
Takano, IEICE Transactions on Information and Systems, Vol.E103-D, No.5, IEICE, May 2020, Paper (arXiv)

Optane DCPMMʼs Bandwidth is limited.
6
For read-only access, DCPMMʼs bandwidth is 37%.
For read/write access, DCPMMʼs bandwidth is 8%.

Huge Performance Gap in Main Memory
• Optane DCPMM
– Capacity is 10 times larger.
– Latency is 4 times higher.
– Bandwidth is 10%.
• PCM and ReRAM
– Slow write operations
– High write energy
• STT-MRAM (around 2016)
– High write energy
7
Emerging Memory DeviceDRAM
DRAM Emerging Memory Device
Physical Address
Offset 0 Offset M Offset N
Advantage in capacity, but huge difference
in some performance metrics
• Main memory now becomes hybrid.
• New system software studies are necessary.

• Virtually create a unified memory combined with DRAM
and byte-addressable NVM (e.g., Optane DCPMM)
• Make it appear like one memory device to an OS and
applications
• Everything is done at the hypervisor layer
• Support any OSes and applications without any
modification to them
• Maximize the performance of a VM even with a small mixed
ratio of DRAM
• Hot memory pages are automatically relocated to
DRAM (i.e., fast memory)
• Cold memory pages are automatically relocated to
NVM (i.e., slow memory)
• https://github.com/takahiro-hirofuchi/raminate
• ACM SoCC 2016 Best Paper Award!
Byte-addressable NVM
(Optane, MRAM, PCM)
DRAM
The main memory of a VM
(It appears like a DRAM)
RAMinate
(Extended Qemu/KVM)
1. Create a unified RAM
2. Monitor memory access
3. Optimize page mapping
RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory

Overview of RAMinate (1)
9
• Allocate the RAM of a VM from the DRAM and NVM pools
• Guest OS sees a uniform RAM composed of one memory
device
NVM PoolDRAM Pool
1GB 15GB
DRAM 16GB
Virtual Machine (VM)
RAMinate (Extended Qemu/KVM)

10
NVM PoolDRAM Pool
1a. Detect read/write-intensive guest physical pages on NVM

11
NVM PoolDRAM Pool
1b. Detect guest physical pages on DRAM
that have few read/write operations

2. Swap read/write-intensive NVM pages
with DRAM pages having few read/write requests
12
NVM PoolDRAM Pool
Divert read/write traffic to
DRAM (faster memory) behind
the guest OS

Read
Thpt
(MB/s)
Write
Thpt
(MB/s)
Elapsed Time (s) The Number of Page Relocations
1. Just after kernel build started, most memory traffic
was from/to the DCPMM side. But, after RAMinate
optimized page locations, the memory traffic of DCPMM
was reduced to 50%.
2. RAMinate detected hot memory pages and
moved them to the DRAM side. It also moved
cold memory pages to the Optane side.
2b. It continuously updated
memory mappings in response
to the change of hot memory
pages.
2a. Initially, it optimized
page locations intensively.
A Use-case of RAMinate for Optane DCPMM
A VM is set up with 4 GB RAM of the mixed
ratio of 1% DRAM and 99% DCPMM.
The time required for
kernel compile was
reduced by 10% in
comparison to no
optimization case.

How is RAMinate implemented?
• Everything is done in the software layer
– Only requirements
• A memory device is mapped to physical address space
• Access bit / dirty bit of a page table entry for VMs (e.g., Extended Page Table
in Intel CPU)
– Any (rich) optimization algorithm is possible
• No need for a special memory controller
– In comparison to Memory Mode of Optane DCPMM
• Memory Mode: DRAM as L4 cache, transparent to OS
• Can customize the mixed ratio of DRAM and NVM per VM
14

Design Details (Component 1)
RAMinate periodically scans Extended Page Table entries
of a VM and finds dirty/accessed guest pages.
9
EPT maintains mapping between guest frame numbers and physical
frame numbers. When a guest page is written, CPU sets the dirty bit.

16
Q0
Q1
Q2
Q3
Q0
Q1
Q2
Q3
Promote a page when the number of
past detected accesses of a page
exceeds a threshold value.
Demote a page when the elapsed time
from its last access exceeds a threshold
value.
The optimization algorithm, Corked Multi Queue, determines which
guest pages must be migrated, considering memory access history.
The queue level relates to the number of past page accesses. NVM
pages in higher levels of queues are candidates for page migration.

• Compare CMQ, LRU and nothing
• Use trace data during SPEC CPU 2006
17
DRAM hit ratio The number of swapped pages
CMQ drastically reduces the
number of necessary page swaps.
Simulation on optimization algorithms

18
RAMinate dynamically updates page
mapping between guest and physical
pages without disrupting the guest
operating system.

• Temporary stop the VM during optimization
1. Stop the VM
2. Conduct page swap operations (repeated for 1000 pairs at most)
1. Swap the PFNs of 2 EPT entries
2. Swap the data of the memory pages
3. Restart the VM
19
Page Swap in Detail

• Downtime is approximately 30ms for 1000
swap operations.
• Short enough, acceptable
– Page mapping optimization is performed every 5
second in the default setting.
20
Page Swap Overhead

21
• Assumption at the moment of 2016:
STT-MRAM is fast like DRAM. However, its write energy is 10^2 time larger than that of DRAM.
• Hybrid memory with 10% DRAM outperformed DRAM-only memory in terms of energy consumption.
A Use-case of RAMinate for Future STT-MRAM
Energy
Consumption of
Main Memory
ACM SoCC 2016, Hirofuchi et. al.

Task Schedular
(Dynamically adjust CPU time
assignment)
MESMERIC: A Software-based Emulator for Non-volatile Main Memory
CPU Cores
Performance Model
(Analyze cache miss in real time)
LLC PMC
Memory Controllers PMC
PMC
• Virtually create a slow NVM for a target program, by adjusting the
execution speed of the program running on a DRAM-equipped
machine
• Fast and accurate
• Aware of asymmetric read/write latencies (i.e., advantage over HPʼs
Quartz)
DRAM
(Mimicked like NVM)
Target Program
22
A Software-based NVM Emulator
Supporting Read/Write Asymmetric
Latencies, IEICE Trans., Dec, 2019

METICULOUS: An FPGA-based Emulator for Hybrid Main Memory
• Emulate hybrid main memory
systems in the hardware level
– Set latency, bandwidth and bit-flips in
each physical address range
• Enable detailed performance
evaluation of new system software
mechanisms
23
Reproducible behaviors
Coarse-grained Fine-grained
Speed
Fast
（Real time）
Slow
Cycle-accurate CPU
simulators（ex. gem5）
Software-based emulator
(MESMERIC)
FPGA-based
emulator
(METICULOUS)
DRAM
Latency Insertion
Bandwidth Limit.
FPGACPU
Latency Insertion
Bandwidth Limit.
LLC
CPU
Cores

Intermediate Summary
• Emerging memory devices
– PCM, MRAM, ReRAM
– Optane DCPMM
• Main memory now becomes hybrid
– New system software studies are necessary
• RAMinate
– Hypervisor-based virtualization for hybrid main memory systems
• MESMERIC:
– Software-based emulator using cache miss information
– Support asymmetric read/write latencies
• METICULOUS
– FPGA-based emulator
– Enable system software studies on emerging memory devices
24

More Information
• RAMinate
– https://github.com/takahiro-hirofuchi/raminate
• MESMERIC
– https://takahiro-hirofuchi.github.io/mesmeric-emulator/
– Binary executables under the MIT license
• METICULOUS
– https://takahiro-hirofuchi.github.io/meticulous-emulator/
– FPGA bitstream under the MIT license
25

Error Permissive Computing
26
What’s next?
A New Approach for Post Moore's Computer System Design
https://error-permissive-computing.github.io/
Contributors:
Ryousei Takano, Takahiro Hirofuchi, Mohamed Wahib, Truong Thao Nguyen, Akram Ben Ahmed,
Hiroki Kanezashi (RWBC, AIST)
Hiroko Arai and Hiroshi Imamura (Spintronics RC, AIST)

We are hiring!
27
• Tenured-track Researchers
• https://www.aist.go.jp/aist_e/humanres/recr
uitment_information.html#ITHF
• By the middle of May
• Postdocs
• Technical assistants
• https://error-permissive-computing.github.io/

• Permissive in Cambridge Dictionary
“A person or society that is permissive allows behavior that
other people might disapprove of:”
https://dictionary.cambridge.org/ja/dictionary/english/permissive
• According to colleagues born and raised in US
– Permissive sounds like we accept new ideas and new approaches;
itʼs not conservative.
– Permissive implies that we do have a control knob.
28
Error Permissive?

The Definition of Error Permissive Computing
• We define that,
– Error Permissive Computing is to controllably allow hardware
errors and develop system software to assure acceptable
computational results.
• In other words,
– Develop a control knob to dynamically change hardware reliability,
and then improve performance as long as obtaining acceptable
results
• We do not use “approximate computing”.
– Approximate Computing has too broad meanings.
29

Motivation (1)
• A computer requires too much reliability for memory devices
• A long way ahead until a new memory device is in practical use
30
Bit error ratio (BER) of a
new memory device
Development periodDiscovery of the
principle of operation
10-9
10-15
Acceptable BERs in
memory systems
Practical use
(tens of years later）
Very long-term efforts are necessary to make it practical.

10-9
10-15
Easy to employ new memory devices
10-5
Accept high BER
Motivation (2)
31
Bit error ratio (BER) of a
new memory device
Acceptable BERs in
memory systems
Development periodPractical useDiscovery
• What if we accept a high-BER device in the memory system?

Energy-efficient, but unreliable memory device
• Voltage-driven magnetoresistive RAM (Vd-MRAM)
– Very fast read/write operations applicable to CPU cache and main
memory
– 2 orders of magnitudes smaller power consumption
– The spintronics research center of AIST leads development.
• High write error ratio
– The current computer design requires 10-9〜 10-15 of BER.
– The best technology of the MRAM achieves only 10-5.
32

Project Overview (Supported by JSPS Kakenhi A and other funds)
• Rethink whole the computer system from scratch to
accept a high BER memory device
• Open questions
– How do we redesign computer architecture and system
software to accept unreliable memory devices?
– Is it really possible to achieve energy reduction with such
memory devices?
• Task 1: Performance and error modeling of MRAM cells
• Task 2: Computer architecture accepting high BERs
• Task 3: System software accepting high BERs
• Task 4: Emulation techniques
33
CPU Cache Main Memory
CPU Package
Operating System
Programming Runtime
Application
CPU Core
Memory Cell
Task 1
Task 2
Task 3
Task 4

Task 1: Modeling of the write error ratio of MRAM
• The WER of each memory cell is different due to the
variation of magnetic anisotropy.
34

Task 1. Performance and error modeling of MRAM cells
• Study bit-flip error
mechanisms and develop a
performance model
• Joint work with spintronics
researchers
35
An example of tradeoff between error rate and power
consumption in STT-MRAM, which is based on
performance characteristics according to [1].
[1] Dynamics of spin torque switching in all-
perpendicular spin valve nanopillars , JMMM 2014

Task 2. Computer architecture accepting high BERs
• Develop hybrid memory systems
combining reliable and unreliable
memory devices
• Add an error mitigation module
– Something like ECC, but more
lightweight
36
SRAM Unreliable Cache
DRAM Unreliable Memory
CPU Cache
Main Memory
Hybrid Memory Controller
CPU Core
（Extended Instruction Sets）
Error
Mitigation

Task 3: System software accepting high BERs
37
• Develop system software running on
the computer architecture of Task 2
– What kinds of memory objects can be
laced on an unreliable memory device?
– How do we track such memory objects
in programming runtime and operating
system?
– How should data be presented?
• AI and media processing have bit-flip
tolerance to a degree.
Reliable Unreliable Memory
Operating System
Object Analysis and Tracking
Low ß---- Bit-flip tolerance ---à High
Programming Runtime
…
Error
Mitigation

Task 4: Simulation and emulation techniques
38
• Simulate/Emulate the behaviors of
unreliable memory devices
• Integrate the developed performance
model of Task 1 into gem5
• Develop software-based and
hardware-based emulators
CPU
Bit-flip Emulator
CPU Cache PMC
Memory Controller PMC
PMC
Application
Reliable Unreliable Memory
Bit-flip injections
Memory
access data

We are hiring!
39
• Tenured-track Researchers
• https://www.aist.go.jp/aist_e/humanres/recr
uitment_information.html#ITHF
• Postdocs
• Technical assistants
• https://error-permissive-computing.github.io/

Summary
• Virtualization for emerging main memory devices
– RAMinate: Hypervisor-based virtualization
– MESMERIC: Software-based emulator supporting asymmetric
read/write latencies
– METICULOUS: FPGA-based main memory emulator
• Error permissive computing
– Our definition of an approximate computing
– Throughout whole the memory hierarchy
– Now ongoing
40

Virtualization for Emerging Memory Devices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Virtualization for Emerging Memory Devices

Similar to Virtualization for Emerging Memory Devices (20)

Recently uploaded

Recently uploaded (20)

Virtualization for Emerging Memory Devices