•    •        •        •    •    •    •        •        •    •    •        •        •    •
Cloud Workloads      Traditional Workload                Distributed Workload    Reliable hardware, backup and         Tel...
•    ••••    •    •    •    •    •    •
•••    •    •    •    •        •        •    •        •        •
•••    ••    ••    ••••    ••••••    •
raw   swap dump iSCSI   ??      ZFS   NFS CIFS        ??ZFS Volume Emulator (Zvol)       ZFS POSIX Layer (ZPL)          pN...
••    •    •    •    •        •        •        •    •    •
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Storage in a Mission Critical Cloud(stack)
Upcoming SlideShare
Loading in …5

Storage in a Mission Critical Cloud(stack)


Published on

Presso at the Cloudstack Collaboration Conference 2012. Outlining the evolution of the Schuberg Philis Cloud Storage design and how we got to version 1.0. The pitfalls, painful moments and how we overcame these. Should be accompanied by the video that was shot, but I'll add that as soon as it's online.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • On advice from a friend!
  • Go away or I will replace you, be it by a large blob or a small shell script.
  • We had a veryTraditional way of thinking aboutarchitecturesand environments, extremely risk averse. Same goesfor software. Even thoughsome of our engineers have anextremelygoodinsight in application landscape andwould have likedto have donethings a bit differently
  • A classical example of what our infrastructures looks like, these are customers that want to have strict separation between themselves and everyone else, not just software wise but also hardware wise. The enterprise side of the stick. Four years ago I had lengthy discussions with one of our customers about virtualizing their core systems, the opinion at that time in point was that virtualization technology was not mature enough yet to be used, let alone speak about “cloud”. Now this customer is one of the front-runners when it comes to their willingness to move into a cloud like infrastructure with us.
  • The problem with storage as we used to do it, and as our customers demanded us to do it, lots of potential is wasted by the technology choices. Engineers going hybrid, want to go back to block based for the feeling of “safety”, most don’t know what the real dangers are.Part of the mentioned infrastructure here has been “replaced” by iSCSI infrastructure, which required a serious amount of devices too. Although these are a lot power friendlier devices, they are also a lot more devices as the customer wants to push more IO… the big question is ofcourse the type of IO, but that’s a different discussion.
  • The world around us was changing, industrialization of IT as we now it has been going on forever, the I will replace you with a very small shell script
  • Besides this objective there was also the objective to make stuff greener, more measurable and to give ourselves even more insight in our environement and what the effects of changes are in that environement.
  • In the meantime we were getting serious traction with where we wanted to head, and we had a get together with all the stake holders. We should up the stakes a bit, so with regard to bold goals… we decided to get some of the mess we had lying around in racks a decent replacement home, this is a row in our datacenter, two half rows actually of 5 cabinets on each side consisting of employee hardware, old servers, desktops, laptops, an ancient Netapp even tape drives. Looking at the age of the hardware some of the servers there will at least use as much power as two to three modern servers (my NEC dual Xeon was a nice example with it’s 10kRPM uw320 SCSI raid setup). This means that all our private data was going to land on the cloud, and trust me if there’s one thing engineers really dislike is losing their data…..
  • Clouds workloads can be separated into two domains for now, this until the two domains have fully merged, which they will over time, but they’re not just there yet. There are combinations that can be made to enable this largely, but this will not give the guarantees that most enterprises would want to have.Elastic Block Storage vs more traditional ways of doing storage.
  • Let’s be honest, your cloud starts and stops with storage, well okay some networking and compute too, but storage is the cornerstone of your infrastructure, be it fully distributed over local/semi-local leveraged storage or a monolithic box sitting in that single room behind you. FC is like the living dead, expensive dead for a long time, but keeps coming alive as people want to do ridiculous stuff and put an enterprise badge on it. (FCoE)iSCSI, run a FS on top of a lun, in flight data issues when stuff goes bad, people want to run FSCK all the time after something goes wrong…NFS, also called NoFS by some, requires knowledge and experimentation depending on workload to get out of the NoFS zone, to high performant. NFS4.1etcAoE, Ata over ethernet, hard to get onto your hypervisor, light weight is very cool, not routable though….Infiniband backend based storage, cheap alternative to 10G, works well, but does not stretch very well.EBS/Object Based has not reached the maturity level we would like to see, yet… (getting there fast though)The choice between data delivery to your doorstep and what to use behind it. Which protocols and underlying technology to choose from. Reliability is important, but our customers focus on data integrity mostly, as do we. The difference between reliability and integrity lies in the fact that people rely more on the accuracy of the data than the data being available, some customers would prefer their data not to be there and that data would be restored above returning invalid data, the whole point is that you don’t want data loss at the lower level, so ideally you want integrity checking on every level, so self healing crosses the entire stack.Looking at the flexibility and roadmaps of potential vendors and the technology behind their products we also wanted the ability to be able to switch carrier protocol if we had or wanted to, or if it turned out we made a major booboo, we’ll get back to that later…We also wanted the flexibility and something that gave us a confident feeling that it was actively developed, yet mature in every aspect. In the meantime we had to get something up and running to do our pre-beta cloud testing on after we got out of the initial selections of the test phase.We were by no means, as a company, yet ready to accept a fully fledged open source product as our backend, even if we were, would our customers be ? A lot has changed and is changing as we’re evolving ourselves so we’ll have to see where the future takes us.Large single name (Isillon), Unified file systems (Gluster), Object storage (Gluster), build it yourself, BSD, Linux or Solaris?. What to do with Hardware support, is it the market we want to be in ? Full open source or half and half ?
  • FrankenPods is a bunch of hardware that was leftover, scavenged and plugged together, maximizing on memory and CPU grunt we could find.
  • To get primary and secondary storage up as quick as possible we could either go with iSCSI directly from the kit we had lying around, but as some of us are not huge fans of iSCSI as you still have to run a file system on top, and some of us have seen some really nasty stuff happen there with regular file systems. Which in a way also rules out FC, not that I would ever want FC anyway. Packet overhead is comparable.Looking at XenServer 5.6, which we were running the natural choice became NFS, which fitted great with this company we heard about called Nexenta through the buddy of a friend/collegue of mine who’s CTO at Gandi, who run quite a bit of it for their cloud. The beauty of it was that if we would ever change our minds we could go back to doing other stuff but (haha as if.., more on that later)So we downloaded their stuff installed it, didn’t want to reconfigure everything so we left most stuff as it was with the raid setups in the boxes, we could better have chosen pass through from the equallogic and on the local raid controller eventually.. Doing good ole NFS and getting quite decent performance/responsiveness if VMs were already deployed at least, the deployment could take a couple of minutes depending on which Image was chosen.But we were satisfied as we didn’t really have any headaches with it and could just let it run and it would.Mind you that the configuration we did was not possible from the “clicky-clicky” GUI, so we hacked that from the shell.
  • Self healing, object based transaciton groups, no inconsistent state, unlimited snapshots if you would want them. Tunable for application, usage in every way, also with regard to underlying storage fabric, which brings us to the following slide
  • How ZFS is layered. SAS/FC/iSCSI, maybe even others in the future…… (AoE for example), HDFS and Ceph ?
  • With preparing for our final production cloud we had to first build our Beta environment, the question came up if we would have to have sync-mirrored storage. Some people said yes, some people said no. Think of it as The Monkey Banana and Water Spray Experiment, for the people that don’t know I’ll kind of half bake summarize it:This experiment involved 5 monkeys (10 altogether, including replacements), a cage, a banana, a ladder and, an ice cold water hose. …..The motivation for the “nay” sayers was that these problems belong in another layer and should be solved there, if you look at it from a pure software development/system approach. Solve your replication in the application or service application layer and not in a semi-hardware layer. If one would however look at it from the “nay” sayer point of view, the answers would mostly be along the lines of “Oracle RAC stinks” or “it makes everything a lot easier”, “there are applications where it just isn’t possible”. It takes people a while to see each other’s points and eventually the decision was made that some form of sync-mirrored data should be available, that if the case would be that there would be NO other viable way of solving the problem of data replication in an environment we would use that.People tend to forget that data replication means your errors are also replicated, as data replication usually takes place on block or object level, but that is an entirely different discussion again. And replication significantly impacts storage performance, even with NVRAM backed solutions, or solutions that require sync writes to directly passthrough, think of your clarion/vnx metro cluster losing one of the first five disks.
  • Our initial storage design for primary storage looked like this. Using metro clusters on two or three sites, and using local HA clusters on every site. In this example the red tenant is running active-active on it’s own SDN nicira network and has storage in two datacenters the storage is not sync-mirrored so the applications need to take care of it. The green tenant is running metro clustered, his data will be in both locations at all times. Mind you that this not an entirely accurate display of OVS and Nicira usage Another thing we decided, as cloudstack does not require you to have single namespace when adding storage, that we would build our storage in a way so we could just add more if we wanted to. PODs are seen as a logical unit to also scale storage in. but because we had no idea what our real life IO profile would be like we wanted to keep the boxes modular to enable us to adapt as we grow.
  • The entire stack we selected eventually looks like this..
  • Hardware choices, focus on storage. Migrating the data from pre-beta was a pain, as the equalogics under pre-beta were so slow. Most data moved, not all. We’ve had a ton of HA failovers due to one of the heads being produced on monday morning it seemed. Power supplies replaced, power distributor replaced, mother board replaced eventually chassis replaced. The idea was that it would be built for failure as without the storage backend you’re nowhere, and proved to us that the setup we chose worked really really well.
  • Why would you want tripple mirrors ???? And three JBODs and andand dual heads… Initially we started with dual mirrors, and thought about raidz2 which mimicks raid DP in a lot of wais, but really isn’t as raid DP is a bit more efficient than raidZ2, we thought at that time.
  • What did we learn in Beta and what do we think is wise for people to understand and see.Ranging from communication problems about setups and configurations of networking, storage and hypervisors we landed on real issues with our network cards, hypervisor boxes, and also our storage kit.The problems ranged from intermittent problems with throughput on networking and thus hitting storage too. Initially we found out that the Emulex network cards combined with XenServer were giving us grief, when testing we got mixed results from different linux distributions with these cards, they all had different driver versions. So we decided to move to Broadcom, which Nicira also had good experience with. In our storage boxes we kept Emulex as they are known to perform excellent with Solaris and derivatives.
  • Default =off in the systemVM image.Spontaneous reboots of the hypervisors were killing System VMs. But could also cause corruption for all PV hosts running extPainful and long recovery timesRouterVMs starting and hangingFSCK would totally wreck the FSMissing INITSometimes destroying the routerA cludge is nagios, the real solution is in the pipeline, updated systemvm.iso, and an updated systemvm image running a different release
  • Client side measurement of block sizes….rsize, wsize…. The default is on 1MB packets, which is awesome if you’re doing HPC stuff, not really if you’re trying to run VMs. Dynamic window scaling was designed for internet traffic and WAN links, not for the kind of stuff we’re doing with 10GB on a link or 10GB low latency interconnects. The problem however lies in the way packts are used in dynamic window scaling. So we’d see high volume traffic, from a host, going with 300MB/s and after a while all traffic would only be 600KB/s from a bunch of hosts, and at a later stage they would perform well again.
  • Block sizes on storage set way too big, we started with 128k on FS, but in reality it should have been a lot smaller, looking at the block writes we’re doing now it should be way wayway smaller. We initially set it to 32k, but that is even too big. A nice things is that you can dynamically adjust the block sizes, although this will only apply to new data written, and not old data sitting there. For secondary storage we left the block size big, as we’re doing big writes there.
  • In our case it turned out that disks that were slower than other disks were holding up the other disks. Now on it’s own it’s not such a bad thing. But you do want the SPA sync to throttle your incoming IO, unless it gets really crowded.Iostat –en and spasync.d are your friend if things turn pearshaped, the variable to tweak is …. With doing this another corresponding setting needs to be upped the TXG group timeout.Keep in mind that these variables correspond with how fast disks can write data.SAS 15Krpm 3.5” drives do 500ms, SAS 7.2K RPM take around 1000 to 1500.Iostat –en, hdparm –I /dev/disk
  • Secondary storage is Zone wide, so Moving to full layer 3 instead of mixes layer 3/2
  • Talk about the numbers
  • Also
  • Storage in a Mission Critical Cloud(stack)

    1. 1. –––––
    2. 2. • • • • • • • • • • • • • •
    3. 3. Cloud Workloads Traditional Workload Distributed Workload Reliable hardware, backup and Tell users to expect failure. restore bits for users when failure Users to build apps that can happens withstand infrastructure failureBoth types of workloads must run reliably in the cloud
    4. 4. • •••• • • • • • •
    5. 5. ••• • • • • • • • • •
    6. 6. ••• •• •• •••• •••••• •
    7. 7. raw swap dump iSCSI ?? ZFS NFS CIFS ??ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD FC ??
    8. 8. •• • • • • • • • • •
    9. 9. ––––––––
    10. 10. ––––––
    11. 11. ––––