1   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Transcendent MemoryAvi MillerPrincipal Program Manager ORACLEPRODUCT  LOGO
Further reading:https://lwn.net/Articles/454795/3   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Objectives Utilise RAM more effectively         – Lower capital costs         – Lower power utilisation         – Less I/...
Motivation: Memory-inefficient workloads5   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
More motivation: memory capacity wall               1000                                                        # Core    ...
7   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Slide from: Linux kernel support to exploit phase change memory, Linux Symposium 2010, Youngwoo Park, EE KAIST8   Copyrigh...
Disaggregated memory                          DIMM                                                                        ...
OS memory “demand”                                                                                         OSOperatingsyst...
OS Physical Memory Management                                                                                      OSIf yo...
OS Physical Memory Management                                                                                          My ...
OS Memory “Asceticism”  ASSUME          – We should use as little RAM as possible  SUPPOSE          – Mechanism to allow...
Impact on Linux Memory Subsystem14   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
15   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
16   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
17   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
CAPACITY KNOWN Can read or write to any byte.18   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
CAPACITY UNKOWN CAPACITY KNOWN                                                             and may change Can read or writ...
• CAPACITY: known     • USES:        • kernel memory        • user memory        • DMA     • ADDRESSABILITY:        • Read...
• CAPACITY: known     • USES:                                                                • CAPACITY                   ...
• CAPACITY: known     • USES:                                                                • THE RULES        • kernel m...
23   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
We have a page that contains:     And the kernel wants to     “preserve” Tux in Type B     memory.24   Copyright © 2012, O...
We have a page that contains:                                                                            may say NO       ...
We have a page that contains:                                                                            may say NO       ...
We have a page that contains:                                                                            may say NO       ...
We have a page that contains:  Two choices…  1.DEFINITELY want Tux back  2.PROBABLY want Tux back                      tra...
We have a page that contains:Two choices…1.DEFINITELY want Tux back“PERSISTENT PUT”2.PROBABLY want Tux back“EPHEMERAL PUT”...
“PUT”                                                                            “GET”                                    ...
“Normal” RAM addressing • byte-addressable • virtual address: @fffff8000102458 031   Copyright © 2012, Oracle and/or its a...
“Normal” RAM                                                                            Transcendentaddressing            ...
Why bother?33   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Once we’re behind thecurtain, we can dointeresting things…34   Copyright © 2012, Oracle and/or its affiliates. All rights ...
Interesting thing #1                                        virtual machines (aka “guests”)                               ...
Interesting thing #2                                                                            compress                  ...
Interesting thing #3     Transparently move pre-     compressed pages cross a     high-speed coherent     interconnect37  ...
Interesting thing #3     RAMster     Peer-to-peer transcendent     memory38   Copyright © 2012, Oracle and/or its affiliat...
Interesting thing #4SSmem: Transcendent Memory as a“safe” access layer for SSD or NVRAMe.g. as a “RAM extension” not I/O d...
Interesting thing #3     …maybe only one large     memory server shared     by many machines?40   Copyright © 2012, Oracle...
CleancacheMerged in Linux 3.0  A third-level victim cache for otherwise reclaimed clean page cache     pages          – O...
FrontswapMerged in Linux 3.5  Temporary emergency FAST swap page store          – Optionally load-balanced across multipl...
Kernel changes  Frontends require core kernel changes          – Cleancache          – Frontswap  Backends do NOT requir...
Transcendent Memory in LinuxMulti-year merge effortXen                  non-                     name of patchset         ...
Transcendent MemoryOracle Product Plans  Transcendent Memory now in upstream Linux kernel          – cleancache, frontswa...
Pretty Graphs! Facts! Figures!46   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
frontswap patchset diffstatDocumentation/vm/frontswap.txt | 210 +++++++++++++++++++include/linux/frontswap.h      | 126 ++...
A benchmark  Workload:          – make --jN on linux-3.1 source (after make clean)          – Fresh reboot before each ru...
Workload objectiveChanging N varies memory pressure  Small N (4-12)          – No memory pressure                       ...
Native/Baseline (no zcache registered)                                                                                    ...
Review: what is zcache?Captures and compresses evicted clean page cache pages  when clean pages are reclaimed (cleancache...
Review: what is zcache?Captures and compresses swap pages (in RAM)  when a page needs to be swapped out (frontswap “put”)...
Zcache vs native/baseline                                                                            Kernel compile “make ...
Benchmark analysis - zcache  small N (4-12):          –      no memory pressure                    zcache has no effect,...
Review: what is RAMster?Locally compresses swap and clean page cache pages, but stores inremote RAM  Leverages zcache, ad...
Zcache and RAMster                                                                            Kernel compile “make –jN”   ...
Workload Analysis - RAMster  small N (4-12):          –      no memory pressure                    RAMster has no effect...
Questions?58   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
59   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
60   Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Upcoming SlideShare
Loading in …5
×

Transcendent Memory: Not Just for Virtualization Anymore

1,329 views
1,209 views

Published on

RAM is "cheap". Or is it?

If a million machine data center could cut RAM in half, how much could be saved, in capital equipment cost and power/cooling expense?

Transcendent Memory (or "tmem") is a new approach for flexibly, dynamically, and efficiently managing physical memory. First conceived to facilitate the optimization of physical memory utilization among a set of guests in a virtualization environment (and implemented in Xen 4.0), tmem has now also been applied in the kernel to dynamically compress page cache and swap pages ("zcache"), and to dynamically hot-plug memory among a set of kernels ("RAMster").

And tmem may, in the future, allow more effective utilization of future memory-extension technologies. All this with very minimal changes to the kernel required.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,329
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Our agenda for today, I'm going to quickly review the motivation, the key problem, and identify the challenge of optimizing memory utilization in both a virtualized and non-virtualized environment. Transcendent memory (or you may hear me call it “tee-mem”) has a good number of different parts and jargon. If I bring up something that you didn’t hear me explain, or if I miss something that you’d like to hear about, feel free to speak up.
  • If after the presentation, you’d like to hear more, I’d encourage you to read this article... just google for “Transcendent Memory in a Nutshell”.
  • The overall objective of tmem is to utilize RAM more efficiently. There’s a number of possible benefits from that and we’ll talk about these a bit more.
  • Many virtualization users have consolidated their data centers, but find that their CPUs are still spending a lot of time idle. Sometimes this is because the real bottleneck is that there isn’t enough RAM in their systems. One solution is to add more RAM to all of their systems but that can be very expensive and we’d like to first ensure that the memory we do have is being efficiently utilized, not wasted. But it's often not easy to recognize the symptoms of inefficiently utilized memory in a bare metal OS, and it's even harder in a virtualized system. Some of you may call this “memory overcommit” and transcendent memory is one way Oracle products may support “memory overcommit”.
  • If that problem weren't challenging enough, we are starting to see the ratio of memory per core go down over time. This graph shows that, over time, we can expect that every two years, the average core which will be increasing in throughput will have 30% less memory attached to it.
  • and with power consumption becoming more relevant to all of us, we see the percentage of energy in the data center that's used only for powering memory becoming larger.
  • and we are starting to see new kinds of memory, kinda like RAM, but with some idiosyncrasies.
  • and we are also starting to see new architectures with memory fitting in to a system differently than it has in the past. But in the context of this rapidly changing future memory environment, we carry forward with us a very old problem. (ALERT: PIG COMING!)
  • and that is that OS’s are memory hogs. Why? Most OS’s were written many years ago when memory was a scarce and expensive resource and every bit of memory had to be put to what the OS thinks is a good use. So as a result,
  • if you give an OS more memory
  • it’s going to essentially grow fat and use up whatever memory you give it. So it's not very easy to tell if an OS needs more memory or not and similarly it’s not very easy to tell whether it's using the memory it does have efficiently or not. And in a virtualized environment, this creates a real challenge. So, as a first step, it sounds like we need to put those guest OS's on a diet. Which is something I call:
  • memory asceticism. We assume that we'd like an OS not to use up every bit of memory available, but only what it needs. To do that, we need some kind of mechanism for an OS to donate memory to a bigger cause, and a way for an OS to get back some memory when it needs it. But how much memory does an OS "need"? We'll get back to that question in a few minutes, but first let's cover a little more background on one way this can be done.
  • Assume you have a normal computer system with a certain amount of RAM.
  • We're going to take that RAM and split it into two parts.
  • And, for now, we're going to call the two parts Type A memory and Type B memory.
  • To visually represent Type B memory we are going to place a curtain in front of it. This curtain can slide back and forth, meaning the amount of Type A memory -- memory not behind the curtain – may change when the curtain moves.
  • Now you can see -- and measure -- how much Type A memory there is... you know its capacity, and you know how to enumerate the addresses so you know how to read and write to any byte in Type A memory.
  • BUT although you knew how much total memory was in the system, and you know how much Type A memory there is, and although you surely know how to do a simple subtraction, I'd like you to NOT assume you know how much Type B memory there is.Assume the amount of Type B memory is completely unknowable. It might be zero, or it might be a gazillion bytes. You just don't know. And even if you could know how much Type B memory there is right this moment, it might change in the next moment. It's all very dynamic.
  • Since you do know how much Type A memory there is, let's just call that normal memory, or RAM. The OS kernel can decide how to make use of it just like normal. Some of it is used for the kernel itself, some of it to run applications, some for device DMA, etc etc. And the OS kernel decides what every byte is used for, can access any byte directly, and it has complete control over that memory, meaning it can change its mind about how any byte of memory is used whenever it wants. So this is just normal RAM for a normal OS kernel, right?
  • What about this Type B memory? Since you don't know how much there is, obviously you can NOT directly read and write to it using normal processor instructions. For example, if you want to write to byte number one-billion, how do you even know if there is a billion bytes?Instead, we are going to have an interface between the kernel and Type B memory where the kernel needs to ask "permission" and follow certain rules to read and write to Type B memory. Even the all-knowing, all-powerful kernel has to follow these rules. So what are those rules?
  • First, Type B memory can only be read or written a page at a time. A page is usually 4K bytes, but we can be flexible and decide on another page size as long as we are consistently using the same page size. Next, when the kernel wants to write to Type B memory, the kernel must use a special interface that we will call a "put page" call.
  • OK, so we have a page full of data in RAM and the kernel wants to see if it can "put" that page to Type B memory, behind the curtain. Let's call the data in that page ”Tux".
  • OK, so we have a page full of data in RAM and the kernel wants to see if it can "put" that page to Type B memory, behind the curtain. Let's call the data in that page ”Tux".
  • If the kernel wants to "put" a page full of data to Type B memory, it’s important to note that the kernel can be told NO. Kernels have big egos and don’t like it when they are told no, so we have to train them to be more well-mannered and gracious by using the defined “put page” call. Anyway, the kernel has two options.
  • First option is pretty normal: The kernel says "Here's a page of data called Tux... Mr Type B memory, can you take Tux? BUT if you say yes, I KNOW I'm going to need to get Larry back later, so you'd better keep him around. You can do whatever you want with him, BUT if I ask for him back, you'd damn well better give him back to me. BUT, one exception, if I reboot, you can throw him away. So, can you take him?... and a reminder, that the kernel is asking permission and Type B memory may say no.
  • Or the kernel can say: "Here's a page of data called Tux... Mr Type B memory, can you take Tux for me and squirrel him away someplace? I may ask for him back later, or I may not. And if you have room for him now, and then you need to throw him away later, that's fine too." So for this kind of "put", the kernel has to accept that there is some probability that it might get the page of data back if it asks for it, and some probability that the data might completely disappear... So in the first of the two choices, the probability that the kernel might get the data back is 100%, and in the second case, the probability is less than 100%. It may be a lot less than 100%, we just don't know, because it's all very dynamic.
  • OK, although Larry can be very entertaining, let's take a step back for a moment and give these ideas some names. First, instead of "Type B memory", we are going to use the term: "Transcendent Memory", or "tmem" for short. The word “transcendent” means "beyond the senses" and, by definition, Type B is beyond the sensors of the kernel because, well,the kernel can't enumerate it and can't address it like it addresses normal memory, instead only a page of data at a time. And the kernel has to overcome its ego and ask for permission.
  • The two types of "puts", we are going to call "persistent" and "ephemeral". The kind of "put" where we know we can definitely get Larry back, 100% of the time, we are going to call a "persistent put".And the kind of "put" where we don't care if we get Larry back, where the probability is less than 100%, we are going to call an "ephemeral put".
  • And when we ask Transcendent Memory for that page of data back, we are going to call that operation a "get." And if the kernel knows it isn’t going to need that page of data anymore and wants to tell tmem to throw it away, we will call that a flush. How does the kernel identify the page of data that it wants to put, get, or flush?
  • Well for normal RAM there is thing called a "physical address", using which you can access any byte of RAM. And the processor has a large fancy virtual address space that it can use. Can't do that with Transcendent Memory.
  • For transcendent memory, for puts and gets, we need to provide a new kind of addressing, that we call a "handle", which is kind of an object-oriented name for a page. Within certain constraints, the OS kernel gets to decide what "handle" to use when "put"ing the page of data and then uses that same handle when it wants to "get" it.
  • One example: Maybe the kernel is running as a guest, a virtual machine, and that "different place" is special memory owned and managed by the hypervisor? This is actually where the concept of Transcendent Memory began over four years ago and the host-side has been implemented in Xen for three years and the guest-side works today in Oracles’ Unbreakable Enterprise Kernel.
  • So virtualization is one example. Another example, we could compress Tux. Since most data compresses by about a factor of two, that could potentially save a lot of RAM. This functionality is fully working today, is called "zcache" and has been merged in the upstream Linux kernel tree for about a year and a half. With zcache, Type A memory is normal addressable kernel memory and Type B memory consists entirely of compressed pages.
  • Or... we could send Tux to a completely different place as long as we can get him back if and when we need to. Maybe that place is some underutilized RAM on a completely different machine.
  • That's a feature called RAMster, which went into Linux earlier this year.Rather than a guest and a hypervisor, we view multiple physical machines in a cluster as peers and allow them to work together to dynamically load balance their memory demand. In this cluster, if one machine is overloaded and another is basically idle, the idle machine’s RAM can be used to store pages of data for the overloaded machine. Kind of a poor man’s virtualization. This actually works pretty well over any protocol that supports kernel sockets, even a 100Mbit Ethernet connection.
  • Or maybe it's some solid state device that we are using not as an I/O device but as a RAM extension.As you may know, solid state devices, or SSD's, are getting very fast, almost as fast as RAM, but they have a number of idiosyncrasies that make it difficult for them to be used instead of RAM. It turns out that the rules of Transcendent Memory might be a good way to work around those idiosyncrasies. Or maybe can combine the last two ideas...
  • maybe that "different place" is some solid state device on another machine, that serves as a shared RAM extension for any and all of a set of blades in a cabinet, depending on what blade at any given time is short on memory. These other ideas are in the early stages of exploration.
  • So this may seem like a lot of cool stuff, but doesn’t it require massive changes to the kernel? The answer, fortunately, is NOThe concept of an ephemeral put is a really good match for something the kernel does all the time, namely to “evict” clean page cache pages. A fairly simple non-invasive patch to Linux called the cleancachepatchset allows the kernel to use transcendent memory for these types of pages.
  • Similarly, something called “anonymous” pages represent the important data of running applications, and when the kernel is running short on memory, it starts swapping these anonymous pages. And swap pages happen to be a really good match for transcendent memory’s “persistent” pages. We called this the frontswappatchset and that too was very clean and non- invasive.
  • Since cleancache and frontswap feed pages to transcendent memory, we call them “frontends”. It’s no coincidence that page cache pages and anonymous pages constitute the vast majority of pages the kernel manages in a running system so the frontends can pass a lot of pages to tmem.It’s also no coincidence that cleancache and frontswap interface cleanly to any of zcache, ramster, Xen, or future transcendent memory implementations, which we call tmem “backends”.
  • Transcendent Memory: Not Just for Virtualization Anymore

    1. 1. 1 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    2. 2. Transcendent MemoryAvi MillerPrincipal Program Manager ORACLEPRODUCT LOGO
    3. 3. Further reading:https://lwn.net/Articles/454795/3 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    4. 4. Objectives Utilise RAM more effectively – Lower capital costs – Lower power utilisation – Less I/O Better performance on many workloads – Negligible loss on others4 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    5. 5. Motivation: Memory-inefficient workloads5 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    6. 6. More motivation: memory capacity wall 1000 # Core GB DRAM 100 10 1 2003 2004 2005 2006 2007 2008 2009 2010 2012 2013 2014 2015 2016 2017 Memory capacity per core drops ~30% every 2 years 2011 Source: Disaggregated Memory for Expansion and Sharing in Blade Server http://isca09.cs.columbia.edu/pres/24.pptx6 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    7. 7. 7 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    8. 8. Slide from: Linux kernel support to exploit phase change memory, Linux Symposium 2010, Youngwoo Park, EE KAIST8 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    9. 9. Disaggregated memory DIMM DIMM DIMM DIMM DIMM CPUs CPUs DIMM DIMM DIMM Exofabric DIMM DIMM DIMM DIMM DIMM CPUs CPUs DIMM DIMM DIMM Leverage fast, shared Memory communication fabrics blade Source: Disaggregated Memory for Expansion and Sharing in Blade Server http://isca09.cs.columbia.edu/pres/24.pptx9 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    10. 10. OS memory “demand” OSOperatingsystems arememory hogs! Memory constraint 10 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    11. 11. OS Physical Memory Management OSIf you give anoperatingsystem morememory… New larger memory constraint 11 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    12. 12. OS Physical Memory Management My name is Linux and I am a… it uses up memory hogany memoryyou give it! Memory constraint 12 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    13. 13. OS Memory “Asceticism”  ASSUME – We should use as little RAM as possible  SUPPOSE – Mechanism to allow the OS to surrender RAM – Mechanism to allow the OS to obtain more RAM  THEN – How does an OS decide how much RAM it actually needs? as-cet-i-cism, n. 1. extreme self-denial and austerity; rigorous self-discipline and active restraint; renunciation of material comforts so as to achieve a higher state13 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    14. 14. Impact on Linux Memory Subsystem14 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    15. 15. 15 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    16. 16. 16 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    17. 17. 17 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    18. 18. CAPACITY KNOWN Can read or write to any byte.18 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    19. 19. CAPACITY UNKOWN CAPACITY KNOWN and may change Can read or write to dynamically! any byte.19 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    20. 20. • CAPACITY: known • USES: • kernel memory • user memory • DMA • ADDRESSABILITY: • Read/write any byte20 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    21. 21. • CAPACITY: known • USES: • CAPACITY -“unknowable” • kernel memory - dynamic SO… • user memory kernel/CPU can’t • DMA SO… address directly! • ADDRESSABILITY: Need “permission” to access and need • Read/write any byte to “follow rules” (even the kernel!)21 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    22. 22. • CAPACITY: known • USES: • THE RULES • kernel memory 1. “page”-at-a-time • user memory 2. to put data here, • DMA kernel MUST use a • ADDRESSABILITY: “put page call” • Read/write any byte 3. (more rules later)22 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    23. 23. 23 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    24. 24. We have a page that contains: And the kernel wants to “preserve” Tux in Type B memory.24 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    25. 25. We have a page that contains: may say NO to kernel! And the kernel wants to “preserve” Tux into Type B memory… but… Kernel MUST ask permission and may get told NO!25 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    26. 26. We have a page that contains: may say NO to kernel!And the kernel wants to “preserve”Tux into Type B memory. may commit toTwo choices… keeping the1.DEFINITELY want Tux back page around…(e.g. “dirty” page)26 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    27. 27. We have a page that contains: may say NO to kernel! And the kernel wants to “preserve” Tux into Type B memory. Two choices… may commit 1.DEFINITELY want Tux back to keeping the 2.PROBABLY want Tux back page around… (but OK if disappears, e.g. “clean” pages) or may not!27 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    28. 28. We have a page that contains: Two choices… 1.DEFINITELY want Tux back 2.PROBABLY want Tux back tran-scend-ent, adj., … beyond the range of normal perception28 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    29. 29. We have a page that contains:Two choices…1.DEFINITELY want Tux back“PERSISTENT PUT”2.PROBABLY want Tux back“EPHEMERAL PUT” eph-em-er-al, adj., … transitory, existing only briefly, short- tran-scend-ent, adj., … beyond the lived (i.e. NOT persistent) range of normal perception29 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    30. 30. “PUT” “GET” “FLUSH” Core Transcendent Memory Operations30 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    31. 31. “Normal” RAM addressing • byte-addressable • virtual address: @fffff8000102458 031 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    32. 32. “Normal” RAM Transcendentaddressing Memory• byte-addressable • object-oriented addressing• virtual address: • object is a page • handle addresses a page@fffff80001024580 • kernel can (mostly) choose handle when a page is put • uses same handle to get • must ensure handle is and remains unique32 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    33. 33. Why bother?33 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    34. 34. Once we’re behind thecurtain, we can dointeresting things…34 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    35. 35. Interesting thing #1 virtual machines (aka “guests”) hypervisor (aka “host”) hypervisor Tmem support: Tmem supported in RAM • multiple guests Xen since 4.0 (2009) • compression • deduplication future?35 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    36. 36. Interesting thing #2 compress on put decompress on get Zcache (2.6.39 staging driver)36 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    37. 37. Interesting thing #3 Transparently move pre- compressed pages cross a high-speed coherent interconnect37 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    38. 38. Interesting thing #3 RAMster Peer-to-peer transcendent memory38 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    39. 39. Interesting thing #4SSmem: Transcendent Memory as a“safe” access layer for SSD or NVRAMe.g. as a “RAM extension” not I/O device 39 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    40. 40. Interesting thing #3 …maybe only one large memory server shared by many machines?40 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    41. 41. CleancacheMerged in Linux 3.0  A third-level victim cache for otherwise reclaimed clean page cache pages – Optionally load-balanced across multiple clients  Cleancache patchset: – VFS hooks to put clean page cache pages, get them back, maintain coherency – Per filesystem opt-in hooks – Shim to zcache in 2.6.39 – Shim to Xen tmem in 3.041 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    42. 42. FrontswapMerged in Linux 3.5  Temporary emergency FAST swap page store – Optionally load-balanced across multiple clients  Frontswap patchset: – Swap subsystem hooks to put and get swap cache pages – Maintain coherency – Manages tracking data structures (1 bit/page) – Partial swapoff – Shim to zcache in 2.6.39 – Shim to Xen tmem merged in 3.142 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    43. 43. Kernel changes  Frontends require core kernel changes – Cleancache – Frontswap  Backends do NOT require core kernel chances – Zcache, RAMster, Xen tmem all implemented as drivers43 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    44. 44. Transcendent Memory in LinuxMulti-year merge effortXen non- name of patchset Linux Xen versionN Y zcache/zcache2 2.6.39/3.7 staging driverY Y cleancache 3.0 Linus decided!Y N Xen-tmem, selfballooning 3.1Y ? frontswap-selfshrinking 3.1Y Y Frontswap 3.5 Linus decided!? Y RAMster (merged w/zcache2) 3.4/3.7 staging driverY Y module support, frontswap unuse, 3.8? under development frontswap admission improvements44 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    45. 45. Transcendent MemoryOracle Product Plans  Transcendent Memory now in upstream Linux kernel – cleancache, frontswap – guest kernel support (aka Xen tmem) – zcache – RAMster  Transcendent Memory support has been in the Xen hypervisor for over 2 years. – Available in Oracle VM 2.2 and 3.x  Transcendent Memory in UEK2 for a year – cleancache, frontswap – guest kernel support (aka Xen tmem) – zcache2 coming soon45 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    46. 46. Pretty Graphs! Facts! Figures!46 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    47. 47. frontswap patchset diffstatDocumentation/vm/frontswap.txt | 210 +++++++++++++++++++include/linux/frontswap.h | 126 ++++++++++++include/linux/swap.h | 4include/linux/swapfile.h | 13 +mm/Kconfig | 17 ++mm/Makefile | 1mm/frontswap.c | 273 +++++++++++++++++++mm/page_io.c | 12 +mm/swapfile.c | 64 +++++--9 files changed, 707 insertions(+), 13 deletions(-)  Low core maintenance impact – ~100 lines  No impact if CONFIG_FRONTSWAP=n  Negligible impact if CONFIG_FRONTSWAP=y and no backend  How much benefit per backend?47 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    48. 48. A benchmark  Workload: – make --jN on linux-3.1 source (after make clean) – Fresh reboot before each run – All tests run as root in multi-user mode  Software: – Linux 3.2  Hardware: – Dell Optiplex 790 (~$500) – Intel Core i5-2400 @ 3.1Ghz (Quad Core/Hyperthreaded – 6M cache) – 1GB DDR3 RAM @ 1333Mhz (limited by memmap) – One 7200rpm SATA 6Gpbs drive with 8MB cache – 10GB swap partition – 1Gb ethernet48 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    49. 49. Workload objectiveChanging N varies memory pressure  Small N (4-12) – No memory pressure  Page cache never fills to exceed RAM, no swapping  Medium N (16-24) – Moderate memory pressure  Page cache fills so lots of reclaiming, but little to no swapping  Large N (28-36) – High memory pressure  Much page cache churn, lots of swapping  Largest N (40) – Extreme memory pressure  Little space for page cache churn, swap storm occurs49 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    50. 50. Native/Baseline (no zcache registered) did not Kernel compile “make –jN” complete (smaller is be er) (18000+) 8000 4000 (elapsed me) seconds 2000 1000 500 4 8 12 16 20 24 28 32 36 40 no zcache 879 858 858 1009 1316 2164 3293 4286 6516 DNC50 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    51. 51. Review: what is zcache?Captures and compresses evicted clean page cache pages  when clean pages are reclaimed (cleancache “put”) – zcache compresses/stores contents of evicted pages in RAM – zcache has “shrinker hook” for if kernel runs low  when filesystem reads file pages (cleancache “get”) – zcache checks if it has a copy, if so decompresses/returns – else reads from filesystem/disk as normal  One disk access saved for every successful “get”51 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    52. 52. Review: what is zcache?Captures and compresses swap pages (in RAM)  when a page needs to be swapped out (frontswap “put”) – zcache compresses/stores contents of swap page in RAM – zcache enforces policies, may reject some (or all) pages – frontswap maintains a bit map for saved/rejected swap pages  when a page needs to be swapped in (frontswap “get”) – if frontswap bit is set, zcache decompresses/returns – else read from swap disk as normal  One disk write+read saved for every successful “get”52 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    53. 53. Zcache vs native/baseline Kernel compile “make –jN” (smaller is be er) 8000 (elapsed me) 4000 seconds 2000 1000 Up to 26-31% faster 500 4 8 12 16 20 24 28 32 36 40 no zcache 879 858 858 1009 1316 2164 3293 4286 6516 zcache 877 856 856 922 1154 1714 2500 4282 6602 1375553 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    54. 54. Benchmark analysis - zcache  small N (4-12): – no memory pressure  zcache has no effect, but apparently no measurable cost either  medium N (16-20): – moderate memory pressure  zcache increases total pages cached due to compression  performance improves 9%-14%  large N (24-28) – high memory pressure  zcache increases total pages cached due to compression  AND zcache uses RAM for compressed swap to avoid swap-to-disk  performance improves 26%-31%  large N (32-36) – very high memory pressure  compressed page cache gets reclaimed before use, no advantage  compressed in-RAM swap counteracted by smaller kernel page cache?  performance improves /loses 0%-(1%)  largest N (40): – extreme memory pressure – in-RAM swap compression reduces worst case swapstorm54 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    55. 55. Review: what is RAMster?Locally compresses swap and clean page cache pages, but stores inremote RAM  Leverages zcache, adds cluster code using kernel sockets  same as zcache but also “remotifies” compressed swap pages to another system’s RAM – One disk write+read saved for every successful swap “get” (at cost of some network traffic) – One disk access saved for every successful page cache “get” (at cost of some network traffic)  Peer-to-peer or client-server (currently up to 8 nodes)  RAM management is entirely dynamic55 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    56. 56. Zcache and RAMster Kernel compile “make –jN” (smaller is be er) 8000 (elapsed me) 4000 seconds 2000 1000 500 4 8 12 16 20 24 28 32 36 40 no zcache 879 858 858 1009 1316 2164 3293 4286 6516 zcache 877 856 856 922 1154 1714 2500 4282 6602 13755 ramster 887 866 875 949 1162 1788 2177 3599 5394 817256 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    57. 57. Workload Analysis - RAMster  small N (4-12): – no memory pressure  RAMster has no effect, but small cost  medium N (16-20): – moderate memory pressure  RAMster increases total pages cached due to compression  performance improves 6%-13%  somewhat slower than zcache  large N (24-28) – high memory pressure  RAMster increases total pages cached (local) due to compression  and RAMster uses remote RAM for to avoid swap-to-disk  performance improves 21%-51%  large N (32-36) – very high memory pressure  compressed page cache gets reclaimed before use, no advantage  but RAMster still uses remote (compressed) RAM to avoid swap-to-disk  performance improves 19%-22% (vs zcache and native)  largest N (40): – extreme memory pressure  use of remote RAM significantly reduces worst case swapstorm57 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    58. 58. Questions?58 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    59. 59. 59 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
    60. 60. 60 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.

    ×