RAM is "cheap". Or is it?
If a million machine data center could cut RAM in half, how much could be saved, in capital equipment cost and power/cooling expense?
Transcendent Memory (or "tmem") is a new approach for flexibly, dynamically, and efficiently managing physical memory. First conceived to facilitate the optimization of physical memory utilization among a set of guests in a virtualization environment (and implemented in Xen 4.0), tmem has now also been applied in the kernel to dynamically compress page cache and swap pages ("zcache"), and to dynamically hot-plug memory among a set of kernels ("RAMster").
And tmem may, in the future, allow more effective utilization of future memory-extension technologies. All this with very minimal changes to the kernel required.
Our agenda for today, I'm going to quickly review the motivation, the key problem, and identify the challenge of optimizing memory utilization in both a virtualized and non-virtualized environment. Transcendent memory (or you may hear me call it “tee-mem”) has a good number of different parts and jargon. If I bring up something that you didn’t hear me explain, or if I miss something that you’d like to hear about, feel free to speak up.
If after the presentation, you’d like to hear more, I’d encourage you to read this article... just google for “Transcendent Memory in a Nutshell”.
The overall objective of tmem is to utilize RAM more efficiently. There’s a number of possible benefits from that and we’ll talk about these a bit more.
Many virtualization users have consolidated their data centers, but find that their CPUs are still spending a lot of time idle. Sometimes this is because the real bottleneck is that there isn’t enough RAM in their systems. One solution is to add more RAM to all of their systems but that can be very expensive and we’d like to first ensure that the memory we do have is being efficiently utilized, not wasted. But it's often not easy to recognize the symptoms of inefficiently utilized memory in a bare metal OS, and it's even harder in a virtualized system. Some of you may call this “memory overcommit” and transcendent memory is one way Oracle products may support “memory overcommit”.
If that problem weren't challenging enough, we are starting to see the ratio of memory per core go down over time. This graph shows that, over time, we can expect that every two years, the average core which will be increasing in throughput will have 30% less memory attached to it.
and with power consumption becoming more relevant to all of us, we see the percentage of energy in the data center that's used only for powering memory becoming larger.
and we are starting to see new kinds of memory, kinda like RAM, but with some idiosyncrasies.
and we are also starting to see new architectures with memory fitting in to a system differently than it has in the past. But in the context of this rapidly changing future memory environment, we carry forward with us a very old problem. (ALERT: PIG COMING!)
and that is that OS’s are memory hogs. Why? Most OS’s were written many years ago when memory was a scarce and expensive resource and every bit of memory had to be put to what the OS thinks is a good use. So as a result,
if you give an OS more memory
it’s going to essentially grow fat and use up whatever memory you give it. So it's not very easy to tell if an OS needs more memory or not and similarly it’s not very easy to tell whether it's using the memory it does have efficiently or not. And in a virtualized environment, this creates a real challenge. So, as a first step, it sounds like we need to put those guest OS's on a diet. Which is something I call:
memory asceticism. We assume that we'd like an OS not to use up every bit of memory available, but only what it needs. To do that, we need some kind of mechanism for an OS to donate memory to a bigger cause, and a way for an OS to get back some memory when it needs it. But how much memory does an OS "need"? We'll get back to that question in a few minutes, but first let's cover a little more background on one way this can be done.
Assume you have a normal computer system with a certain amount of RAM.
We're going to take that RAM and split it into two parts.
And, for now, we're going to call the two parts Type A memory and Type B memory.
To visually represent Type B memory we are going to place a curtain in front of it. This curtain can slide back and forth, meaning the amount of Type A memory -- memory not behind the curtain – may change when the curtain moves.
Now you can see -- and measure -- how much Type A memory there is... you know its capacity, and you know how to enumerate the addresses so you know how to read and write to any byte in Type A memory.
BUT although you knew how much total memory was in the system, and you know how much Type A memory there is, and although you surely know how to do a simple subtraction, I'd like you to NOT assume you know how much Type B memory there is.Assume the amount of Type B memory is completely unknowable. It might be zero, or it might be a gazillion bytes. You just don't know. And even if you could know how much Type B memory there is right this moment, it might change in the next moment. It's all very dynamic.
Since you do know how much Type A memory there is, let's just call that normal memory, or RAM. The OS kernel can decide how to make use of it just like normal. Some of it is used for the kernel itself, some of it to run applications, some for device DMA, etc etc. And the OS kernel decides what every byte is used for, can access any byte directly, and it has complete control over that memory, meaning it can change its mind about how any byte of memory is used whenever it wants. So this is just normal RAM for a normal OS kernel, right?
What about this Type B memory? Since you don't know how much there is, obviously you can NOT directly read and write to it using normal processor instructions. For example, if you want to write to byte number one-billion, how do you even know if there is a billion bytes?Instead, we are going to have an interface between the kernel and Type B memory where the kernel needs to ask "permission" and follow certain rules to read and write to Type B memory. Even the all-knowing, all-powerful kernel has to follow these rules. So what are those rules?
First, Type B memory can only be read or written a page at a time. A page is usually 4K bytes, but we can be flexible and decide on another page size as long as we are consistently using the same page size. Next, when the kernel wants to write to Type B memory, the kernel must use a special interface that we will call a "put page" call.
OK, so we have a page full of data in RAM and the kernel wants to see if it can "put" that page to Type B memory, behind the curtain. Let's call the data in that page ”Tux".
OK, so we have a page full of data in RAM and the kernel wants to see if it can "put" that page to Type B memory, behind the curtain. Let's call the data in that page ”Tux".
If the kernel wants to "put" a page full of data to Type B memory, it’s important to note that the kernel can be told NO. Kernels have big egos and don’t like it when they are told no, so we have to train them to be more well-mannered and gracious by using the defined “put page” call. Anyway, the kernel has two options.
First option is pretty normal: The kernel says "Here's a page of data called Tux... Mr Type B memory, can you take Tux? BUT if you say yes, I KNOW I'm going to need to get Larry back later, so you'd better keep him around. You can do whatever you want with him, BUT if I ask for him back, you'd damn well better give him back to me. BUT, one exception, if I reboot, you can throw him away. So, can you take him?... and a reminder, that the kernel is asking permission and Type B memory may say no.
Or the kernel can say: "Here's a page of data called Tux... Mr Type B memory, can you take Tux for me and squirrel him away someplace? I may ask for him back later, or I may not. And if you have room for him now, and then you need to throw him away later, that's fine too." So for this kind of "put", the kernel has to accept that there is some probability that it might get the page of data back if it asks for it, and some probability that the data might completely disappear... So in the first of the two choices, the probability that the kernel might get the data back is 100%, and in the second case, the probability is less than 100%. It may be a lot less than 100%, we just don't know, because it's all very dynamic.
OK, although Larry can be very entertaining, let's take a step back for a moment and give these ideas some names. First, instead of "Type B memory", we are going to use the term: "Transcendent Memory", or "tmem" for short. The word “transcendent” means "beyond the senses" and, by definition, Type B is beyond the sensors of the kernel because, well,the kernel can't enumerate it and can't address it like it addresses normal memory, instead only a page of data at a time. And the kernel has to overcome its ego and ask for permission.
The two types of "puts", we are going to call "persistent" and "ephemeral". The kind of "put" where we know we can definitely get Larry back, 100% of the time, we are going to call a "persistent put".And the kind of "put" where we don't care if we get Larry back, where the probability is less than 100%, we are going to call an "ephemeral put".
And when we ask Transcendent Memory for that page of data back, we are going to call that operation a "get." And if the kernel knows it isn’t going to need that page of data anymore and wants to tell tmem to throw it away, we will call that a flush. How does the kernel identify the page of data that it wants to put, get, or flush?
Well for normal RAM there is thing called a "physical address", using which you can access any byte of RAM. And the processor has a large fancy virtual address space that it can use. Can't do that with Transcendent Memory.
For transcendent memory, for puts and gets, we need to provide a new kind of addressing, that we call a "handle", which is kind of an object-oriented name for a page. Within certain constraints, the OS kernel gets to decide what "handle" to use when "put"ing the page of data and then uses that same handle when it wants to "get" it.
One example: Maybe the kernel is running as a guest, a virtual machine, and that "different place" is special memory owned and managed by the hypervisor? This is actually where the concept of Transcendent Memory began over four years ago and the host-side has been implemented in Xen for three years and the guest-side works today in Oracles’ Unbreakable Enterprise Kernel.
So virtualization is one example. Another example, we could compress Tux. Since most data compresses by about a factor of two, that could potentially save a lot of RAM. This functionality is fully working today, is called "zcache" and has been merged in the upstream Linux kernel tree for about a year and a half. With zcache, Type A memory is normal addressable kernel memory and Type B memory consists entirely of compressed pages.
Or... we could send Tux to a completely different place as long as we can get him back if and when we need to. Maybe that place is some underutilized RAM on a completely different machine.
That's a feature called RAMster, which went into Linux earlier this year.Rather than a guest and a hypervisor, we view multiple physical machines in a cluster as peers and allow them to work together to dynamically load balance their memory demand. In this cluster, if one machine is overloaded and another is basically idle, the idle machine’s RAM can be used to store pages of data for the overloaded machine. Kind of a poor man’s virtualization. This actually works pretty well over any protocol that supports kernel sockets, even a 100Mbit Ethernet connection.
Or maybe it's some solid state device that we are using not as an I/O device but as a RAM extension.As you may know, solid state devices, or SSD's, are getting very fast, almost as fast as RAM, but they have a number of idiosyncrasies that make it difficult for them to be used instead of RAM. It turns out that the rules of Transcendent Memory might be a good way to work around those idiosyncrasies. Or maybe can combine the last two ideas...
maybe that "different place" is some solid state device on another machine, that serves as a shared RAM extension for any and all of a set of blades in a cabinet, depending on what blade at any given time is short on memory. These other ideas are in the early stages of exploration.
So this may seem like a lot of cool stuff, but doesn’t it require massive changes to the kernel? The answer, fortunately, is NOThe concept of an ephemeral put is a really good match for something the kernel does all the time, namely to “evict” clean page cache pages. A fairly simple non-invasive patch to Linux called the cleancachepatchset allows the kernel to use transcendent memory for these types of pages.
Similarly, something called “anonymous” pages represent the important data of running applications, and when the kernel is running short on memory, it starts swapping these anonymous pages. And swap pages happen to be a really good match for transcendent memory’s “persistent” pages. We called this the frontswappatchset and that too was very clean and non- invasive.
Since cleancache and frontswap feed pages to transcendent memory, we call them “frontends”. It’s no coincidence that page cache pages and anonymous pages constitute the vast majority of pages the kernel manages in a running system so the frontends can pass a lot of pages to tmem.It’s also no coincidence that cleancache and frontswap interface cleanly to any of zcache, ramster, Xen, or future transcendent memory implementations, which we call tmem “backends”.