Transcendent Memory: Not Just for Virtualization Anymore

Transcendent Memory
Avi Miller
Principal Program Manager

ORACLE
PRODUCT
LOGO

Further reading:
https://lwn.net/Articles/454795/


Objectives

 Utilise RAM more effectively
– Lower capital costs
– Lower power utilisation
– Less I/O

 Better performance on many workloads
– Negligible loss on others


Motivation: Memory-inefficient workloads


More motivation: memory capacity wall

1000
# Core
GB DRAM
100

10

1
2003

2004

2005

2006

2007

2008

2009

2010

2012

2013

2014

2015

2016

2017
Memory capacity per core drops ~30% every 2 years 2011

Source: Disaggregated Memory for Expansion and Sharing in Blade Server
http://isca09.cs.columbia.edu/pres/24.pptx


Slide from: Linux kernel support to exploit phase change memory, Linux Symposium 2010, Youngwoo Park, EE KAIST


Disaggregated memory

DIMM DIMM
DIMM DIMM
DIMM CPUs CPUs DIMM
DIMM DIMM

Exofabric
DIMM DIMM
DIMM DIMM
DIMM CPUs CPUs DIMM
DIMM DIMM

Leverage fast, shared Memory
communication fabrics blade
Source: Disaggregated Memory for Expansion and Sharing in Blade Server
http://isca09.cs.columbia.edu/pres/24.pptx


OS memory “demand”

OS
Operating
systems are
memory hogs!

Memory constraint


OS Physical Memory Management

OS

If you give an
operating
system more
memory…

New larger memory
constraint


OS Physical Memory Management

My name is
Linux and I
am a
… it uses up memory hog
any memory
you give it!

Memory constraint


OS Memory “Asceticism”

 ASSUME
– We should use as little RAM as possible

 SUPPOSE
– Mechanism to allow the OS to surrender RAM
– Mechanism to allow the OS to obtain more RAM

 THEN
– How does an OS decide how much RAM it actually needs?

as-cet-i-cism, n. 1. extreme self-denial and austerity; rigorous self-discipline and
active restraint; renunciation of material comforts so as to achieve a higher state


Impact on Linux Memory Subsystem


CAPACITY KNOWN
Can read or write to
any byte.


CAPACITY UNKOWN
CAPACITY KNOWN and may change
Can read or write to
dynamically!
any byte.


• CAPACITY: known
• USES:
• kernel memory
• user memory
• DMA
• ADDRESSABILITY:
• Read/write any byte


• CAPACITY: known
• USES: • CAPACITY
-“unknowable”
• kernel memory - dynamic
SO…
• user memory kernel/CPU can’t
• DMA SO…
address directly!

• ADDRESSABILITY: Need “permission”
to access and need
• Read/write any byte to “follow rules”
(even the kernel!)


• CAPACITY: known
• USES: • THE RULES
• kernel memory 1. “page”-at-a-time
• user memory 2. to put data here,
• DMA kernel MUST use a
• ADDRESSABILITY: “put page call”
• Read/write any byte 3. (more rules later)


We have a page that contains:

And the kernel wants to
“preserve” Tux in Type B
memory.



may say NO
to kernel!

And the kernel wants to “preserve”
Tux into Type B memory… but…

Kernel MUST ask permission
and may get told NO!



may say NO
to kernel!

Tux into Type B memory. may commit to
Two choices… keeping the
1.DEFINITELY want Tux back page around…
(e.g. “dirty” page)



may say NO
to kernel!

Tux into Type B memory.
Two choices… may commit
1.DEFINITELY want Tux back to keeping the
2.PROBABLY want Tux back page around…
(but OK if disappears, e.g. “clean” pages) or may not!


Two choices…
1.DEFINITELY want Tux back
2.PROBABLY want Tux back

tran-scend-ent, adj., … beyond the range of normal perception



Two choices…
1.DEFINITELY want Tux back
“PERSISTENT PUT”
2.PROBABLY want Tux back
“EPHEMERAL PUT”
eph-em-er-al, adj., … transitory, existing only briefly, short- tran-scend-ent, adj., … beyond the
lived (i.e. NOT persistent) range of normal perception


“PUT”

“GET”

“FLUSH”
Core Transcendent Memory Operations


“Normal” RAM
addressing
• byte-addressable
• virtual address:
@fffff8000102458
0


“Normal” RAM
Transcendent
addressing
Memory
• byte-addressable • object-oriented addressing
• virtual address: • object is a page
• handle addresses a page
@fffff80001024580 • kernel can (mostly) choose
handle when a page is put
• uses same handle to get
• must ensure handle is
and remains unique


Why bother?


Once we’re behind the
curtain, we can do
interesting things…


Interesting thing #1

virtual machines (aka “guests”)
hypervisor (aka “host”)

hypervisor Tmem support: Tmem supported in
RAM • multiple guests Xen since 4.0 (2009)

• compression
• deduplication future?



compress
on put

decompress
on get

Zcache
(2.6.39 staging driver)



Transparently move pre-
compressed pages cross a
high-speed coherent
interconnect



RAMster
Peer-to-peer transcendent
memory



SSmem: Transcendent Memory as a
“safe” access layer for SSD or NVRAM
e.g. as a “RAM extension” not I/O device


…maybe only one large
memory server shared
by many machines?


Cleancache
Merged in Linux 3.0

 A third-level victim cache for otherwise reclaimed clean page cache
pages
– Optionally load-balanced across multiple clients

 Cleancache patchset:
– VFS hooks to put clean page cache pages, get them back, maintain
coherency
– Per filesystem opt-in hooks
– Shim to zcache in 2.6.39
– Shim to Xen tmem in 3.0


Frontswap
Merged in Linux 3.5

 Temporary emergency FAST swap page store
– Optionally load-balanced across multiple clients

 Frontswap patchset:
– Swap subsystem hooks to put and get swap cache pages
– Maintain coherency
– Manages tracking data structures (1 bit/page)
– Partial swapoff
– Shim to zcache in 2.6.39
– Shim to Xen tmem merged in 3.1


Kernel changes

 Frontends require core kernel changes
– Cleancache
– Frontswap

 Backends do NOT require core kernel chances
– Zcache, RAMster, Xen tmem all implemented as drivers


Transcendent Memory in Linux
Multi-year merge effort

Xen non- name of patchset Linux
Xen version

N Y zcache/zcache2 2.6.39/3.7 staging driver

Y Y cleancache 3.0 Linus decided!

Y N Xen-tmem, selfballooning 3.1

Y ? frontswap-selfshrinking 3.1

Y Y Frontswap 3.5 Linus decided!

? Y RAMster (merged w/zcache2) 3.4/3.7 staging driver

Y Y module support, frontswap unuse, 3.8? under development
frontswap admission improvements


Transcendent Memory
Oracle Product Plans

 Transcendent Memory now in upstream Linux kernel
– cleancache, frontswap
– guest kernel support (aka Xen tmem)
– zcache
– RAMster
 Transcendent Memory support has been in the Xen hypervisor for
over 2 years.
– Available in Oracle VM 2.2 and 3.x
 Transcendent Memory in UEK2 for a year
– cleancache, frontswap
– guest kernel support (aka Xen tmem)
– zcache2 coming soon


Pretty Graphs! Facts! Figures!


frontswap patchset diffstat

Documentation/vm/frontswap.txt | 210 +++++++++++++++++++
include/linux/frontswap.h | 126 ++++++++++++
include/linux/swap.h | 4
include/linux/swapfile.h | 13 +
mm/Kconfig | 17 ++
mm/Makefile | 1
mm/frontswap.c | 273 +++++++++++++++++++
mm/page_io.c | 12 +
mm/swapfile.c | 64 +++++--
9 files changed, 707 insertions(+), 13 deletions(-)

 Low core maintenance impact
– ~100 lines
 No impact if CONFIG_FRONTSWAP=n
 Negligible impact if CONFIG_FRONTSWAP=y and no backend
 How much benefit per backend?


A benchmark

 Workload:
– make --jN on linux-3.1 source (after make clean)
– Fresh reboot before each run
– All tests run as root in multi-user mode
 Software:
– Linux 3.2
 Hardware:
– Dell Optiplex 790 (~$500)
– Intel Core i5-2400 @ 3.1Ghz (Quad Core/Hyperthreaded – 6M cache)
– 1GB DDR3 RAM @ 1333Mhz (limited by memmap)
– One 7200rpm SATA 6Gpbs drive with 8MB cache
– 10GB swap partition
– 1Gb ethernet


Workload objective
Changing N varies memory pressure

 Small N (4-12)
– No memory pressure
 Page cache never fills to exceed RAM, no swapping
 Medium N (16-24)
– Moderate memory pressure
 Page cache fills so lots of reclaiming, but little to no swapping
 Large N (28-36)
– High memory pressure
 Much page cache churn, lots of swapping
 Largest N (40)
– Extreme memory pressure
 Little space for page cache churn, swap storm occurs


Native/Baseline (no zcache registered)
did not
Kernel compile “make –jN” complete
(smaller is be er) (18000+)

8000

4000
(elapsed me)
seconds

2000

1000

500
4 8 12 16 20 24 28 32 36 40
no zcache 879 858 858 1009 1316 2164 3293 4286 6516 DNC


Review: what is zcache?
Captures and compresses evicted clean page cache pages

 when clean pages are reclaimed (cleancache “put”)
– zcache compresses/stores contents of evicted pages in RAM
– zcache has “shrinker hook” for if kernel runs low
 when filesystem reads file pages (cleancache “get”)
– zcache checks if it has a copy, if so decompresses/returns
– else reads from filesystem/disk as normal
 One disk access saved for every successful “get”


Review: what is zcache?
Captures and compresses swap pages (in RAM)

 when a page needs to be swapped out (frontswap “put”)
– zcache compresses/stores contents of swap page in RAM
– zcache enforces policies, may reject some (or all) pages
– frontswap maintains a bit map for saved/rejected swap pages
 when a page needs to be swapped in (frontswap “get”)
– if frontswap bit is set, zcache decompresses/returns
– else read from swap disk as normal
 One disk write+read saved for every successful “get”


Zcache vs native/baseline
Kernel compile “make –jN”
(smaller is be er)

8000
(elapsed me)

4000
seconds

2000

1000
Up to 26-31% faster
500
4 8 12 16 20 24 28 32 36 40
no zcache 879 858 858 1009 1316 2164 3293 4286 6516
zcache 877 856 856 922 1154 1714 2500 4282 6602 13755


Benchmark analysis - zcache

 small N (4-12):
– no memory pressure
 zcache has no effect, but apparently no measurable cost either
 medium N (16-20):
– moderate memory pressure
 zcache increases total pages cached due to compression
 performance improves 9%-14%
 large N (24-28)
– high memory pressure
 zcache increases total pages cached due to compression
 AND zcache uses RAM for compressed swap to avoid swap-to-disk
 large N (32-36)
– very high memory pressure
 compressed page cache gets reclaimed before use, no advantage
 compressed in-RAM swap counteracted by smaller kernel page cache?
 performance improves /loses 0%-(1%)
 largest N (40):
– extreme memory pressure
– in-RAM swap compression reduces worst case swapstorm


Review: what is RAMster?
Locally compresses swap and clean page cache pages, but stores in
remote RAM

 Leverages zcache, adds cluster code using kernel sockets
 same as zcache but also “remotifies” compressed swap pages to
another system’s RAM
– One disk write+read saved for every successful swap “get” (at cost
of some network traffic)
– One disk access saved for every successful page cache “get” (at
cost of some network traffic)
 Peer-to-peer or client-server (currently up to 8 nodes)
 RAM management is entirely dynamic


Zcache and RAMster
Kernel compile “make –jN”
(smaller is be er)

8000
(elapsed me)

4000
seconds

2000

1000

500
4 8 12 16 20 24 28 32 36 40
no zcache 879 858 858 1009 1316 2164 3293 4286 6516
zcache 877 856 856 922 1154 1714 2500 4282 6602 13755
ramster 887 866 875 949 1162 1788 2177 3599 5394 8172


Workload Analysis - RAMster

 small N (4-12):
– no memory pressure
 RAMster has no effect, but small cost
 medium N (16-20):
– moderate memory pressure
 RAMster increases total pages cached due to compression
 somewhat slower than zcache
 large N (24-28)
– high memory pressure
 RAMster increases total pages cached (local) due to compression
 and RAMster uses remote RAM for to avoid swap-to-disk
 large N (32-36)
– very high memory pressure
 compressed page cache gets reclaimed before use, no advantage
 but RAMster still uses remote (compressed) RAM to avoid swap-to-disk
 performance improves 19%-22% (vs zcache and native)
 largest N (40):
– extreme memory pressure
 use of remote RAM significantly reduces worst case swapstorm


Questions?


Transcendent Memory: Not Just for Virtualization Anymore

Recommended

Recommended

More Related Content

Similar to Transcendent Memory: Not Just for Virtualization Anymore

Similar to Transcendent Memory: Not Just for Virtualization Anymore (20)

Recently uploaded

Recently uploaded (20)

Transcendent Memory: Not Just for Virtualization Anymore

Editor's Notes