SlideShare a Scribd company logo
1 of 40
Download to read offline
Virtualization
for Emerging Memory Devices
1
Takahiro Hirofuchi, Ph.D.
Senior Research Scientist, Deputy Director of RWBC-OIL,
National Institute of Advanced Industrial Science and Technology (AIST)
Invited Presentation, Cool Chips 23, Apr 16, 2020
The updated version of slides will be uploaded to my homepage.
https://takahiro-hirofuchi.github.io/
DRAM is not scalable anymore …
• Energy consuming
– Need refresh energy
– More than 30% of energy
consumption of a data center
• Not scalable anymore
– Serious energy dissipation
• Memory capacity wall
• Memory bandwidth wall
2
capacitor
DRAM
High
resistance
Emerging memory devices are promising.
3
Phase Change Memory
(PCM)
Pinned layer
Free layer
Low
resistance
High
resistance
CrystalAmorphous
Low
resistance
High
resistance
Magneto-resistive Random
Access Memory (MRAM)
• Energy efficient
– No refresh energy
– Non-volatile
• Scalable in theory
– Higher density will be possible
Resistive Random Access
Memory (ReRAM)
Filaments
forming
Low
resistance
• Potential to drastically extend
the capacity of main memory
• Potential for 3D-stack
integration drastically increasing
bandwidths
• Released in April 2019
• First byte-addressable NVM based on
a next-generation memory device
– 3D-stacked resistive memory cells
– (Possibly) PCM-based
• Connected to the memory bus of CPU
via the DIMM interface
• 128-512 GB per memory module
– A DRAM module is typically 8-32 GB.
Intel Optane Data Center Persistent Memory (DCPMM)
4
Optane DCPMMDRAM
Memory
Controller
CPU
Optane DCPMMʼs Latency is higher than DRAM.
5
For read-only access, DCPMMʼs latency is 4 times larger.
For read/write access, DCPMMʼs latency is 4.1 times larger.
[1] A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module, Takahiro Hirofuchi, Ryousei
Takano, IEICE Transactions on Information and Systems, Vol.E103-D, No.5, IEICE, May 2020, Paper (arXiv)
Optane DCPMMʼs Bandwidth is limited.
6
For read-only access, DCPMMʼs bandwidth is 37%.
For read/write access, DCPMMʼs bandwidth is 8%.
Huge Performance Gap in Main Memory
• Optane DCPMM
– Capacity is 10 times larger.
– Latency is 4 times higher.
– Bandwidth is 10%.
• PCM and ReRAM
– Slow write operations
– High write energy
• STT-MRAM (around 2016)
– High write energy
7
Emerging Memory DeviceDRAM
DRAM Emerging Memory Device
Physical Address
Offset 0 Offset M Offset N
Advantage in capacity, but huge difference
in some performance metrics
• Main memory now becomes hybrid.
• New system software studies are necessary.
• Virtually create a unified memory combined with DRAM
and byte-addressable NVM (e.g., Optane DCPMM)
• Make it appear like one memory device to an OS and
applications
• Everything is done at the hypervisor layer
• Support any OSes and applications without any
modification to them
• Maximize the performance of a VM even with a small mixed
ratio of DRAM
• Hot memory pages are automatically relocated to
DRAM (i.e., fast memory)
• Cold memory pages are automatically relocated to
NVM (i.e., slow memory)
• https://github.com/takahiro-hirofuchi/raminate
• ACM SoCC 2016 Best Paper Award!
Byte-addressable NVM
(Optane, MRAM, PCM)
DRAM
The main memory of a VM
(It appears like a DRAM)
RAMinate
(Extended Qemu/KVM)
1. Create a unified RAM
2. Monitor memory access
3. Optimize page mapping
RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory
Overview of RAMinate (1)
9
• Allocate the RAM of a VM from the DRAM and NVM pools
• Guest OS sees a uniform RAM composed of one memory
device
NVM PoolDRAM Pool
1GB 15GB
DRAM 16GB
Virtual Machine (VM)
RAMinate (Extended Qemu/KVM)
Overview of RAMinate (2)
10
NVM PoolDRAM Pool
1a. Detect read/write-intensive guest physical pages on NVM
Overview of RAMinate (3)
11
NVM PoolDRAM Pool
1b. Detect guest physical pages on DRAM
that have few read/write operations
Overview of RAMinate (4)
2. Swap read/write-intensive NVM pages
with DRAM pages having few read/write requests
12
NVM PoolDRAM Pool
Divert read/write traffic to
DRAM (faster memory) behind
the guest OS
Read
Thpt
(MB/s)
Write
Thpt
(MB/s)
Elapsed Time (s) The Number of Page Relocations
1. Just after kernel build started, most memory traffic
was from/to the DCPMM side. But, after RAMinate
optimized page locations, the memory traffic of DCPMM
was reduced to 50%.
2. RAMinate detected hot memory pages and
moved them to the DRAM side. It also moved
cold memory pages to the Optane side.
2b. It continuously updated
memory mappings in response
to the change of hot memory
pages.
2a. Initially, it optimized
page locations intensively.
A Use-case of RAMinate for Optane DCPMM
A VM is set up with 4 GB RAM of the mixed
ratio of 1% DRAM and 99% DCPMM.
The time required for
kernel compile was
reduced by 10% in
comparison to no
optimization case.
How is RAMinate implemented?
• Everything is done in the software layer
– Only requirements
• A memory device is mapped to physical address space
• Access bit / dirty bit of a page table entry for VMs (e.g., Extended Page Table
in Intel CPU)
– Any (rich) optimization algorithm is possible
• No need for a special memory controller
– In comparison to Memory Mode of Optane DCPMM
• Memory Mode: DRAM as L4 cache, transparent to OS
• Can customize the mixed ratio of DRAM and NVM per VM
14
Design Details (Component 1)
RAMinate periodically scans Extended Page Table entries
of a VM and finds dirty/accessed guest pages.
9
EPT maintains mapping between guest frame numbers and physical
frame numbers. When a guest page is written, CPU sets the dirty bit.
16
Q0
Q1
Q2
Q3
Q0
Q1
Q2
Q3
Promote a page when the number of
past detected accesses of a page
exceeds a threshold value.
Demote a page when the elapsed time
from its last access exceeds a threshold
value.
Design Details (Component 2)
The optimization algorithm, Corked Multi Queue, determines which
guest pages must be migrated, considering memory access history.
The queue level relates to the number of past page accesses. NVM
pages in higher levels of queues are candidates for page migration.
• Compare CMQ, LRU and nothing
• Use trace data during SPEC CPU 2006
17
DRAM hit ratio The number of swapped pages
CMQ drastically reduces the
number of necessary page swaps.
Simulation on optimization algorithms
18
Design Details (Component 3)
RAMinate dynamically updates page
mapping between guest and physical
pages without disrupting the guest
operating system.
• Temporary stop the VM during optimization
1. Stop the VM
2. Conduct page swap operations (repeated for 1000 pairs at most)
1. Swap the PFNs of 2 EPT entries
2. Swap the data of the memory pages
3. Restart the VM
19
Page Swap in Detail
• Downtime is approximately 30ms for 1000
swap operations.
• Short enough, acceptable
– Page mapping optimization is performed every 5
second in the default setting.
20
Page Swap Overhead
21
• Assumption at the moment of 2016:
STT-MRAM is fast like DRAM. However, its write energy is 10^2 time larger than that of DRAM.
• Hybrid memory with 10% DRAM outperformed DRAM-only memory in terms of energy consumption.
A Use-case of RAMinate for Future STT-MRAM
Energy
Consumption of
Main Memory
ACM SoCC 2016, Hirofuchi et. al.
Task Schedular
(Dynamically adjust CPU time
assignment)
MESMERIC: A Software-based Emulator for Non-volatile Main Memory
CPU Cores
Performance Model
(Analyze cache miss in real time)
LLC PMC
Memory Controllers PMC
PMC
• Virtually create a slow NVM for a target program, by adjusting the
execution speed of the program running on a DRAM-equipped
machine
• Fast and accurate
• Aware of asymmetric read/write latencies (i.e., advantage over HPʼs
Quartz)
DRAM
(Mimicked like NVM)
Target Program
22
A Software-based NVM Emulator
Supporting Read/Write Asymmetric
Latencies, IEICE Trans., Dec, 2019
METICULOUS: An FPGA-based Emulator for Hybrid Main Memory
• Emulate hybrid main memory
systems in the hardware level
– Set latency, bandwidth and bit-flips in
each physical address range
• Enable detailed performance
evaluation of new system software
mechanisms
23
Reproducible behaviors
Coarse-grained Fine-grained
Speed
Fast
(Real time)
Slow
Cycle-accurate CPU
simulators(ex. gem5)
Software-based emulator
(MESMERIC)
FPGA-based
emulator
(METICULOUS)
DRAM
Latency Insertion
Bandwidth Limit.
FPGACPU
Latency Insertion
Bandwidth Limit.
LLC
CPU
Cores
Intermediate Summary
• Emerging memory devices
– PCM, MRAM, ReRAM
– Optane DCPMM
• Main memory now becomes hybrid
– New system software studies are necessary
• RAMinate
– Hypervisor-based virtualization for hybrid main memory systems
• MESMERIC:
– Software-based emulator using cache miss information
– Support asymmetric read/write latencies
• METICULOUS
– FPGA-based emulator
– Enable system software studies on emerging memory devices
24
More Information
• RAMinate
– https://github.com/takahiro-hirofuchi/raminate
• MESMERIC
– https://takahiro-hirofuchi.github.io/mesmeric-emulator/
– Binary executables under the MIT license
• METICULOUS
– https://takahiro-hirofuchi.github.io/meticulous-emulator/
– FPGA bitstream under the MIT license
25
Error Permissive Computing
26
What’s next?
A New Approach for Post Moore's Computer System Design
https://error-permissive-computing.github.io/
Contributors:
Ryousei Takano, Takahiro Hirofuchi, Mohamed Wahib, Truong Thao Nguyen, Akram Ben Ahmed,
Hiroki Kanezashi (RWBC, AIST)
Hiroko Arai and Hiroshi Imamura (Spintronics RC, AIST)
We are hiring!
27
• Tenured-track Researchers
• https://www.aist.go.jp/aist_e/humanres/recr
uitment_information.html#ITHF
• By the middle of May
• Postdocs
• Technical assistants
• https://error-permissive-computing.github.io/
• Permissive in Cambridge Dictionary
“A person or society that is permissive allows behavior that
other people might disapprove of:”
https://dictionary.cambridge.org/ja/dictionary/english/permissive
• According to colleagues born and raised in US
– Permissive sounds like we accept new ideas and new approaches;
itʼs not conservative.
– Permissive implies that we do have a control knob.
28
Error Permissive?
The Definition of Error Permissive Computing
• We define that,
– Error Permissive Computing is to controllably allow hardware
errors and develop system software to assure acceptable
computational results.
• In other words,
– Develop a control knob to dynamically change hardware reliability,
and then improve performance as long as obtaining acceptable
results
• We do not use “approximate computing”.
– Approximate Computing has too broad meanings.
29
Motivation (1)
• A computer requires too much reliability for memory devices
• A long way ahead until a new memory device is in practical use
30
Bit error ratio (BER) of a
new memory device
Development periodDiscovery of the
principle of operation
10-9
10-15
Acceptable BERs in
memory systems
Practical use
(tens of years later)
Very long-term efforts are necessary to make it practical.
10-9
10-15
Easy to employ new memory devices
10-5
Accept high BER
Motivation (2)
31
Bit error ratio (BER) of a
new memory device
Acceptable BERs in
memory systems
Development periodPractical useDiscovery
• What if we accept a high-BER device in the memory system?
Energy-efficient, but unreliable memory device
• Voltage-driven magnetoresistive RAM (Vd-MRAM)
– Very fast read/write operations applicable to CPU cache and main
memory
– 2 orders of magnitudes smaller power consumption
– The spintronics research center of AIST leads development.
• High write error ratio
– The current computer design requires 10-9〜 10-15 of BER.
– The best technology of the MRAM achieves only 10-5.
32
Project Overview (Supported by JSPS Kakenhi A and other funds)
• Rethink whole the computer system from scratch to
accept a high BER memory device
• Open questions
– How do we redesign computer architecture and system
software to accept unreliable memory devices?
– Is it really possible to achieve energy reduction with such
memory devices?
• Task 1: Performance and error modeling of MRAM cells
• Task 2: Computer architecture accepting high BERs
• Task 3: System software accepting high BERs
• Task 4: Emulation techniques
33
CPU Cache Main Memory
CPU Package
Operating System
Programming Runtime
Application
CPU Core
Memory Cell
Task 1
Task 2
Task 3
Task 4
Task 1: Modeling of the write error ratio of MRAM
• The WER of each memory cell is different due to the
variation of magnetic anisotropy.
34
Task 1. Performance and error modeling of MRAM cells
• Study bit-flip error
mechanisms and develop a
performance model
• Joint work with spintronics
researchers
35
An example of tradeoff between error rate and power
consumption in STT-MRAM, which is based on
performance characteristics according to [1].
[1] Dynamics of spin torque switching in all-
perpendicular spin valve nanopillars , JMMM 2014
Task 2. Computer architecture accepting high BERs
• Develop hybrid memory systems
combining reliable and unreliable
memory devices
• Add an error mitigation module
– Something like ECC, but more
lightweight
36
SRAM Unreliable Cache
DRAM Unreliable Memory
CPU Cache
Main Memory
Hybrid Memory Controller
CPU Core
(Extended Instruction Sets)
Error
Mitigation
Task 3: System software accepting high BERs
37
• Develop system software running on
the computer architecture of Task 2
– What kinds of memory objects can be
laced on an unreliable memory device?
– How do we track such memory objects
in programming runtime and operating
system?
– How should data be presented?
• AI and media processing have bit-flip
tolerance to a degree.
Reliable Unreliable Memory
Operating System
Object Analysis and Tracking
Low ß---- Bit-flip tolerance ---à High
Programming Runtime
…
Error
Mitigation
Task 4: Simulation and emulation techniques
38
• Simulate/Emulate the behaviors of
unreliable memory devices
• Integrate the developed performance
model of Task 1 into gem5
• Develop software-based and
hardware-based emulators
CPU
Bit-flip Emulator
CPU Cache PMC
Memory Controller PMC
PMC
Application
Reliable Unreliable Memory
Bit-flip injections
Memory
access data
We are hiring!
39
• Tenured-track Researchers
• https://www.aist.go.jp/aist_e/humanres/recr
uitment_information.html#ITHF
• Postdocs
• Technical assistants
• https://error-permissive-computing.github.io/
Summary
• Virtualization for emerging main memory devices
– RAMinate: Hypervisor-based virtualization
– MESMERIC: Software-based emulator supporting asymmetric
read/write latencies
– METICULOUS: FPGA-based main memory emulator
• Error permissive computing
– Our definition of an approximate computing
– Throughout whole the memory hierarchy
– Now ongoing
40

More Related Content

What's hot

Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingAjil Jose
 
6. Live VM migration
6. Live VM migration6. Live VM migration
6. Live VM migrationHwanju Kim
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loadingmyrajendra
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computerHassan A-j
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Computer architecture
Computer architecture Computer architecture
Computer architecture Ashish Kumar
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
4. Memory virtualization and management
4. Memory virtualization and management4. Memory virtualization and management
4. Memory virtualization and managementHwanju Kim
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimizationK Gowsic Gowsic
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureHaris456
 
High Endurance Last Level Hybrid Cache Design
High Endurance Last Level Hybrid Cache DesignHigh Endurance Last Level Hybrid Cache Design
High Endurance Last Level Hybrid Cache Designkeerthi thallam
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephenSteve Feldman
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architectureMr SMAK
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internalsSisimon Soman
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1Mr SMAK
 
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1IBM India Smarter Computing
 

What's hot (20)

Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel Processing
 
6. Live VM migration
6. Live VM migration6. Live VM migration
6. Live VM migration
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loading
 
Warehouse scale computer
Warehouse scale computerWarehouse scale computer
Warehouse scale computer
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
03 Hadoop
03 Hadoop03 Hadoop
03 Hadoop
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
4. Memory virtualization and management
4. Memory virtualization and management4. Memory virtualization and management
4. Memory virtualization and management
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
 
High Endurance Last Level Hybrid Cache Design
High Endurance Last Level Hybrid Cache DesignHigh Endurance Last Level Hybrid Cache Design
High Endurance Last Level Hybrid Cache Design
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
XS Boston 2008 Memory Overcommit
XS Boston 2008 Memory OvercommitXS Boston 2008 Memory Overcommit
XS Boston 2008 Memory Overcommit
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architecture
 
Windows memory manager internals
Windows memory manager internalsWindows memory manager internals
Windows memory manager internals
 
Lecture 6.1
Lecture  6.1Lecture  6.1
Lecture 6.1
 
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
Linux on System z Optimizing Resource Utilization for Linux under z/VM - Part1
 

Similar to Virtualization for Emerging Memory Devices

network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
SanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraSanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraDataStax Academy
 
GCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorGCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorSeongJae Park
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms finalYutaka Kawai
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldPraveen Kumar
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentationKarthik Iyr
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesKarthik Iyr
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKlecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKofficeaiotfab
 
Introduction of ram ddr3
Introduction of ram ddr3Introduction of ram ddr3
Introduction of ram ddr3Technocratz
 
Introduction of ram ddr3
Introduction of ram ddr3Introduction of ram ddr3
Introduction of ram ddr3Jatin Goyal
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET Journal
 
Flexible and Scalable Domain-Specific Architectures
Flexible and Scalable Domain-Specific ArchitecturesFlexible and Scalable Domain-Specific Architectures
Flexible and Scalable Domain-Specific ArchitecturesNetronome
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
Memory hierarchy.pdf
Memory hierarchy.pdfMemory hierarchy.pdf
Memory hierarchy.pdfISHAN194169
 
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...The Linux Foundation
 

Similar to Virtualization for Emerging Memory Devices (20)

RAMinate Invited Talk at NII
RAMinate Invited Talk at NIIRAMinate Invited Talk at NII
RAMinate Invited Talk at NII
 
RAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 TalkRAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 Talk
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
SanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraSanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and Cassandra
 
GCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory AllocatorGCMA: Guaranteed Contiguous Memory Allocator
GCMA: Guaranteed Contiguous Memory Allocator
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 
Literature survey presentation
Literature survey presentationLiterature survey presentation
Literature survey presentation
 
A survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphonesA survey on exploring memory optimizations in smartphones
A survey on exploring memory optimizations in smartphones
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDKlecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
lecture asdkvakm;bk;dv;advvAVHD;KASV;DVKHSVDK
 
Introduction of ram ddr3
Introduction of ram ddr3Introduction of ram ddr3
Introduction of ram ddr3
 
Introduction of ram ddr3
Introduction of ram ddr3Introduction of ram ddr3
Introduction of ram ddr3
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
 
Flexible and Scalable Domain-Specific Architectures
Flexible and Scalable Domain-Specific ArchitecturesFlexible and Scalable Domain-Specific Architectures
Flexible and Scalable Domain-Specific Architectures
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Memory hierarchy.pdf
Memory hierarchy.pdfMemory hierarchy.pdf
Memory hierarchy.pdf
 
Processing-in-Memory
Processing-in-MemoryProcessing-in-Memory
Processing-in-Memory
 
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
 

Recently uploaded

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Recently uploaded (20)

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

Virtualization for Emerging Memory Devices

  • 1. Virtualization for Emerging Memory Devices 1 Takahiro Hirofuchi, Ph.D. Senior Research Scientist, Deputy Director of RWBC-OIL, National Institute of Advanced Industrial Science and Technology (AIST) Invited Presentation, Cool Chips 23, Apr 16, 2020 The updated version of slides will be uploaded to my homepage. https://takahiro-hirofuchi.github.io/
  • 2. DRAM is not scalable anymore … • Energy consuming – Need refresh energy – More than 30% of energy consumption of a data center • Not scalable anymore – Serious energy dissipation • Memory capacity wall • Memory bandwidth wall 2 capacitor DRAM
  • 3. High resistance Emerging memory devices are promising. 3 Phase Change Memory (PCM) Pinned layer Free layer Low resistance High resistance CrystalAmorphous Low resistance High resistance Magneto-resistive Random Access Memory (MRAM) • Energy efficient – No refresh energy – Non-volatile • Scalable in theory – Higher density will be possible Resistive Random Access Memory (ReRAM) Filaments forming Low resistance • Potential to drastically extend the capacity of main memory • Potential for 3D-stack integration drastically increasing bandwidths
  • 4. • Released in April 2019 • First byte-addressable NVM based on a next-generation memory device – 3D-stacked resistive memory cells – (Possibly) PCM-based • Connected to the memory bus of CPU via the DIMM interface • 128-512 GB per memory module – A DRAM module is typically 8-32 GB. Intel Optane Data Center Persistent Memory (DCPMM) 4 Optane DCPMMDRAM Memory Controller CPU
  • 5. Optane DCPMMʼs Latency is higher than DRAM. 5 For read-only access, DCPMMʼs latency is 4 times larger. For read/write access, DCPMMʼs latency is 4.1 times larger. [1] A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module, Takahiro Hirofuchi, Ryousei Takano, IEICE Transactions on Information and Systems, Vol.E103-D, No.5, IEICE, May 2020, Paper (arXiv)
  • 6. Optane DCPMMʼs Bandwidth is limited. 6 For read-only access, DCPMMʼs bandwidth is 37%. For read/write access, DCPMMʼs bandwidth is 8%.
  • 7. Huge Performance Gap in Main Memory • Optane DCPMM – Capacity is 10 times larger. – Latency is 4 times higher. – Bandwidth is 10%. • PCM and ReRAM – Slow write operations – High write energy • STT-MRAM (around 2016) – High write energy 7 Emerging Memory DeviceDRAM DRAM Emerging Memory Device Physical Address Offset 0 Offset M Offset N Advantage in capacity, but huge difference in some performance metrics • Main memory now becomes hybrid. • New system software studies are necessary.
  • 8. • Virtually create a unified memory combined with DRAM and byte-addressable NVM (e.g., Optane DCPMM) • Make it appear like one memory device to an OS and applications • Everything is done at the hypervisor layer • Support any OSes and applications without any modification to them • Maximize the performance of a VM even with a small mixed ratio of DRAM • Hot memory pages are automatically relocated to DRAM (i.e., fast memory) • Cold memory pages are automatically relocated to NVM (i.e., slow memory) • https://github.com/takahiro-hirofuchi/raminate • ACM SoCC 2016 Best Paper Award! Byte-addressable NVM (Optane, MRAM, PCM) DRAM The main memory of a VM (It appears like a DRAM) RAMinate (Extended Qemu/KVM) 1. Create a unified RAM 2. Monitor memory access 3. Optimize page mapping RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory
  • 9. Overview of RAMinate (1) 9 • Allocate the RAM of a VM from the DRAM and NVM pools • Guest OS sees a uniform RAM composed of one memory device NVM PoolDRAM Pool 1GB 15GB DRAM 16GB Virtual Machine (VM) RAMinate (Extended Qemu/KVM)
  • 10. Overview of RAMinate (2) 10 NVM PoolDRAM Pool 1a. Detect read/write-intensive guest physical pages on NVM
  • 11. Overview of RAMinate (3) 11 NVM PoolDRAM Pool 1b. Detect guest physical pages on DRAM that have few read/write operations
  • 12. Overview of RAMinate (4) 2. Swap read/write-intensive NVM pages with DRAM pages having few read/write requests 12 NVM PoolDRAM Pool Divert read/write traffic to DRAM (faster memory) behind the guest OS
  • 13. Read Thpt (MB/s) Write Thpt (MB/s) Elapsed Time (s) The Number of Page Relocations 1. Just after kernel build started, most memory traffic was from/to the DCPMM side. But, after RAMinate optimized page locations, the memory traffic of DCPMM was reduced to 50%. 2. RAMinate detected hot memory pages and moved them to the DRAM side. It also moved cold memory pages to the Optane side. 2b. It continuously updated memory mappings in response to the change of hot memory pages. 2a. Initially, it optimized page locations intensively. A Use-case of RAMinate for Optane DCPMM A VM is set up with 4 GB RAM of the mixed ratio of 1% DRAM and 99% DCPMM. The time required for kernel compile was reduced by 10% in comparison to no optimization case.
  • 14. How is RAMinate implemented? • Everything is done in the software layer – Only requirements • A memory device is mapped to physical address space • Access bit / dirty bit of a page table entry for VMs (e.g., Extended Page Table in Intel CPU) – Any (rich) optimization algorithm is possible • No need for a special memory controller – In comparison to Memory Mode of Optane DCPMM • Memory Mode: DRAM as L4 cache, transparent to OS • Can customize the mixed ratio of DRAM and NVM per VM 14
  • 15. Design Details (Component 1) RAMinate periodically scans Extended Page Table entries of a VM and finds dirty/accessed guest pages. 9 EPT maintains mapping between guest frame numbers and physical frame numbers. When a guest page is written, CPU sets the dirty bit.
  • 16. 16 Q0 Q1 Q2 Q3 Q0 Q1 Q2 Q3 Promote a page when the number of past detected accesses of a page exceeds a threshold value. Demote a page when the elapsed time from its last access exceeds a threshold value. Design Details (Component 2) The optimization algorithm, Corked Multi Queue, determines which guest pages must be migrated, considering memory access history. The queue level relates to the number of past page accesses. NVM pages in higher levels of queues are candidates for page migration.
  • 17. • Compare CMQ, LRU and nothing • Use trace data during SPEC CPU 2006 17 DRAM hit ratio The number of swapped pages CMQ drastically reduces the number of necessary page swaps. Simulation on optimization algorithms
  • 18. 18 Design Details (Component 3) RAMinate dynamically updates page mapping between guest and physical pages without disrupting the guest operating system.
  • 19. • Temporary stop the VM during optimization 1. Stop the VM 2. Conduct page swap operations (repeated for 1000 pairs at most) 1. Swap the PFNs of 2 EPT entries 2. Swap the data of the memory pages 3. Restart the VM 19 Page Swap in Detail
  • 20. • Downtime is approximately 30ms for 1000 swap operations. • Short enough, acceptable – Page mapping optimization is performed every 5 second in the default setting. 20 Page Swap Overhead
  • 21. 21 • Assumption at the moment of 2016: STT-MRAM is fast like DRAM. However, its write energy is 10^2 time larger than that of DRAM. • Hybrid memory with 10% DRAM outperformed DRAM-only memory in terms of energy consumption. A Use-case of RAMinate for Future STT-MRAM Energy Consumption of Main Memory ACM SoCC 2016, Hirofuchi et. al.
  • 22. Task Schedular (Dynamically adjust CPU time assignment) MESMERIC: A Software-based Emulator for Non-volatile Main Memory CPU Cores Performance Model (Analyze cache miss in real time) LLC PMC Memory Controllers PMC PMC • Virtually create a slow NVM for a target program, by adjusting the execution speed of the program running on a DRAM-equipped machine • Fast and accurate • Aware of asymmetric read/write latencies (i.e., advantage over HPʼs Quartz) DRAM (Mimicked like NVM) Target Program 22 A Software-based NVM Emulator Supporting Read/Write Asymmetric Latencies, IEICE Trans., Dec, 2019
  • 23. METICULOUS: An FPGA-based Emulator for Hybrid Main Memory • Emulate hybrid main memory systems in the hardware level – Set latency, bandwidth and bit-flips in each physical address range • Enable detailed performance evaluation of new system software mechanisms 23 Reproducible behaviors Coarse-grained Fine-grained Speed Fast (Real time) Slow Cycle-accurate CPU simulators(ex. gem5) Software-based emulator (MESMERIC) FPGA-based emulator (METICULOUS) DRAM Latency Insertion Bandwidth Limit. FPGACPU Latency Insertion Bandwidth Limit. LLC CPU Cores
  • 24. Intermediate Summary • Emerging memory devices – PCM, MRAM, ReRAM – Optane DCPMM • Main memory now becomes hybrid – New system software studies are necessary • RAMinate – Hypervisor-based virtualization for hybrid main memory systems • MESMERIC: – Software-based emulator using cache miss information – Support asymmetric read/write latencies • METICULOUS – FPGA-based emulator – Enable system software studies on emerging memory devices 24
  • 25. More Information • RAMinate – https://github.com/takahiro-hirofuchi/raminate • MESMERIC – https://takahiro-hirofuchi.github.io/mesmeric-emulator/ – Binary executables under the MIT license • METICULOUS – https://takahiro-hirofuchi.github.io/meticulous-emulator/ – FPGA bitstream under the MIT license 25
  • 26. Error Permissive Computing 26 What’s next? A New Approach for Post Moore's Computer System Design https://error-permissive-computing.github.io/ Contributors: Ryousei Takano, Takahiro Hirofuchi, Mohamed Wahib, Truong Thao Nguyen, Akram Ben Ahmed, Hiroki Kanezashi (RWBC, AIST) Hiroko Arai and Hiroshi Imamura (Spintronics RC, AIST)
  • 27. We are hiring! 27 • Tenured-track Researchers • https://www.aist.go.jp/aist_e/humanres/recr uitment_information.html#ITHF • By the middle of May • Postdocs • Technical assistants • https://error-permissive-computing.github.io/
  • 28. • Permissive in Cambridge Dictionary “A person or society that is permissive allows behavior that other people might disapprove of:” https://dictionary.cambridge.org/ja/dictionary/english/permissive • According to colleagues born and raised in US – Permissive sounds like we accept new ideas and new approaches; itʼs not conservative. – Permissive implies that we do have a control knob. 28 Error Permissive?
  • 29. The Definition of Error Permissive Computing • We define that, – Error Permissive Computing is to controllably allow hardware errors and develop system software to assure acceptable computational results. • In other words, – Develop a control knob to dynamically change hardware reliability, and then improve performance as long as obtaining acceptable results • We do not use “approximate computing”. – Approximate Computing has too broad meanings. 29
  • 30. Motivation (1) • A computer requires too much reliability for memory devices • A long way ahead until a new memory device is in practical use 30 Bit error ratio (BER) of a new memory device Development periodDiscovery of the principle of operation 10-9 10-15 Acceptable BERs in memory systems Practical use (tens of years later) Very long-term efforts are necessary to make it practical.
  • 31. 10-9 10-15 Easy to employ new memory devices 10-5 Accept high BER Motivation (2) 31 Bit error ratio (BER) of a new memory device Acceptable BERs in memory systems Development periodPractical useDiscovery • What if we accept a high-BER device in the memory system?
  • 32. Energy-efficient, but unreliable memory device • Voltage-driven magnetoresistive RAM (Vd-MRAM) – Very fast read/write operations applicable to CPU cache and main memory – 2 orders of magnitudes smaller power consumption – The spintronics research center of AIST leads development. • High write error ratio – The current computer design requires 10-9〜 10-15 of BER. – The best technology of the MRAM achieves only 10-5. 32
  • 33. Project Overview (Supported by JSPS Kakenhi A and other funds) • Rethink whole the computer system from scratch to accept a high BER memory device • Open questions – How do we redesign computer architecture and system software to accept unreliable memory devices? – Is it really possible to achieve energy reduction with such memory devices? • Task 1: Performance and error modeling of MRAM cells • Task 2: Computer architecture accepting high BERs • Task 3: System software accepting high BERs • Task 4: Emulation techniques 33 CPU Cache Main Memory CPU Package Operating System Programming Runtime Application CPU Core Memory Cell Task 1 Task 2 Task 3 Task 4
  • 34. Task 1: Modeling of the write error ratio of MRAM • The WER of each memory cell is different due to the variation of magnetic anisotropy. 34
  • 35. Task 1. Performance and error modeling of MRAM cells • Study bit-flip error mechanisms and develop a performance model • Joint work with spintronics researchers 35 An example of tradeoff between error rate and power consumption in STT-MRAM, which is based on performance characteristics according to [1]. [1] Dynamics of spin torque switching in all- perpendicular spin valve nanopillars , JMMM 2014
  • 36. Task 2. Computer architecture accepting high BERs • Develop hybrid memory systems combining reliable and unreliable memory devices • Add an error mitigation module – Something like ECC, but more lightweight 36 SRAM Unreliable Cache DRAM Unreliable Memory CPU Cache Main Memory Hybrid Memory Controller CPU Core (Extended Instruction Sets) Error Mitigation
  • 37. Task 3: System software accepting high BERs 37 • Develop system software running on the computer architecture of Task 2 – What kinds of memory objects can be laced on an unreliable memory device? – How do we track such memory objects in programming runtime and operating system? – How should data be presented? • AI and media processing have bit-flip tolerance to a degree. Reliable Unreliable Memory Operating System Object Analysis and Tracking Low ß---- Bit-flip tolerance ---à High Programming Runtime … Error Mitigation
  • 38. Task 4: Simulation and emulation techniques 38 • Simulate/Emulate the behaviors of unreliable memory devices • Integrate the developed performance model of Task 1 into gem5 • Develop software-based and hardware-based emulators CPU Bit-flip Emulator CPU Cache PMC Memory Controller PMC PMC Application Reliable Unreliable Memory Bit-flip injections Memory access data
  • 39. We are hiring! 39 • Tenured-track Researchers • https://www.aist.go.jp/aist_e/humanres/recr uitment_information.html#ITHF • Postdocs • Technical assistants • https://error-permissive-computing.github.io/
  • 40. Summary • Virtualization for emerging main memory devices – RAMinate: Hypervisor-based virtualization – MESMERIC: Software-based emulator supporting asymmetric read/write latencies – METICULOUS: FPGA-based main memory emulator • Error permissive computing – Our definition of an approximate computing – Throughout whole the memory hierarchy – Now ongoing 40