ScaleIO is software-defined, distributed shared storage. It is a software-only solution that enables you to create a SAN from direct-attached storage (DAS) located in your servers. ScaleIO creates a large pool of storage that can be shared among all servers. This storage pool can be tiered to supply differing performance needs. ScaleIO is infrastructure-agnostic. It can run on any server, whether physical or virtual, and leverage any storage media, including disk drives, flash drives, or PCIe flash cards.
ScaleIO is all about convergence, scalability, elasticity, and performance. The software converges storage and compute resources into a single architectural layer, which resides on the application server. The architecture allows for scaling out from as little as three servers to thousands by simply adding nodes to the environment. This is done elastically; increasing and decreasing capacity and compute resources can happen “on the fly” without impact to users or applications. ScaleIO also has self-healing capabilities, which enables it to easily recover from server or disk failures. ScaleIO aggregates all the IOPS in the various servers into one high-performing virtual SAN. All servers participate in servicing I/O requests using massively parallel processing.
ScaleIO converges storage and compute resources into a single-layer architecture, aggregating capacity and performance and simplifying management. All I/O and throughput are collective and accessible to any application within the cluster. With ScaleIO, storage is just another application running alongside other applications, and each server is a building block for the global storage and compute cluster. Note to Presenter: Click now in Slide Show mode for animation.Converging storage and compute simplifies the architecture and reduces cost without compromising on any benefit of external storage. ScaleIO enables the IT administrator to singlehandedly manage the entire data center stack, improving operational effectiveness and lowering operational costs.
ScaleIO is designed to massively scale from three to thousands of nodes. Unlike most traditional storage systems, as the number of servers grows, so do throughput and IOPS. The scalability of performance is linear with regard to the growth of the deployment. Whenever the need arises, additional storage and compute resources (i.e., additional servers and drives) can be added modularly. Storage and compute resources grow together so the balance between them is maintained. Storage growth is therefore always automatically aligned with application needs.
With ScaleIO, you can increase or decrease capacity and compute whenever the need arises. You need not go through any complex reconfiguration or adjustments due to interoperability constraints. The system automatically rebalances the data “on the fly” with no downtime. No capacity planning is required, which is a major factor in reducing complexity and cost. You can think about it as being tolerant toward errors in planning. Insufficient storage? Starvation? Just add nodes. Over provisioned? Just remove them. Additions and removals can be done in small or large increments, contributing to the flexibility in managing capacity. ScaleIO will run with just about any commodity hardware—any server or operating systems, bare metal or virtualized, with any storage media (HDDs, SSDs, or PCIe cards). A ScaleIO environment can be comprised of any mix of the above. This is true elasticity.
Every server in the ScaleIO cluster is used in the processing of I/O operations. The architecture is parallel so, unlike a dual controller architecture, there are no bottlenecks or “choke points.” As the number of servers increases, more compute resources are added and utilized. Performance scales linearly and cost/performance rates improve with growth. Performance optimization is automatic; whenever rebuilds and rebalances are needed, they occur in the background with minimal or no impact to running applications. For performance management, manual tiering can be designed by using storage pools. For example, one can create a designated performance tier to utilize low-latency, high-bandwidth flash media and a designated capacity tier to utilize disk spindles of various kinds.
Let’s see how IOPS and capacity are aggregated with ScaleIO.In a traditional environment, if you have 10 servers, each with 1 TB of capacity and 100 IOPS of processing power, then an application running on one of those servers can use only 100 IOPS and 1 TB of storage. Note to Presenter: Click now in Slide Show mode for animation.With ScaleIO, this storage and compute power is aggregated and becomes a total of 1000 IOPS and 10 TB for the system. The same application can now have access to 10 times more capacity and IOPS.Note to Presenter: Click now in Slide Show mode for animation.If you add another10 servers to the cluster, the system now offers 2,000 IOPS and 20 TB. The system scales linearly. Doubling the size doubles the capacity and IOPS. All you have to do is bring the boxes and connect them. ScaleIO will do the rest.
ScaleIO makes much of the traditionalstorage infrastructure unnecessary. You can create a large-scale SAN without arrays, dedicated fabric, or HBAs. With ScaleIO, you can leverage the local storage in your existing servers that often goes unused, ensuring that IT resources aren’t wasted. And you simply add servers to the environment as needed. This gives you great flexibility in deploying various size SANs and modifying them as needed. It also significantly reduces the cost of initial deployment.
As mentioned before, the system rebalances data automatically when resources are added or removed. Let’s see how it works.Note to Presenter: Click now in Slide Show mode for animation.The system virtually manages and reconfigures itself as the underlying resources change. As you add capacity, ScaleIO rearranges the data on the servers to optimize performance and enhance resilience. All of this happens automatically without operator intervention and with minimal impact to applications and users. At the end of a rebalance operation, the system is fully optimized for both performance and data protection. No explicit reconfiguration is needed.
A similar process will happen when nodes are removed. In this illustration, three servers are removed from an eight server cluster. After rebalance, the data is rearranged on the remaining five servers, spread or striped evenly, and is fully redundant.
Managing a ScaleIO deployment is easy. Anything from installation, configuration, monitoring, and upgrade is simple and straightforward. Anyone who manages the data center is capable of fully administering the deployment, without any specialized training and/or vendor certification. The complexity of storage administration is completely eliminated. The screens shown here are all that is needed in order to monitor the ScaleIO system. There is also a simple CLI for configuration and various system actions. Because the system manages itself and takes all the necessary remedial actions when a failure occurs, including re-optimization, there is no need for operator intervention when various events occur. However, ScaleIO features a “call home” capability via email, which alerts the administrator should an event occur. The admin can then take action to respond to the event (if necessary) even outside of business hours. The administrator can follow the system operations and monitor its progress. For example, the actual rebuilds and rebalance operations, when they are executed, can be monitored via the dashboard.
In large data centers, many applications are deployed, many different requirements exist, and operational parameters are dynamic, changing frequently and without much notice. Whether you are a service provider delivering hosted infrastructure as a service or your IT department delivers infrastructure as a service to functional units within your organization, ScaleIO offers a set of features that gives you complete control over performance, capacity and data location. Protection domains allow you to isolate specific servers and data sets. This can be done at the granularity of a single customer so that each customer can be under a different SLA. Storage pools can be used for further data segregation and tiering. For example, data that is accessed very frequently can be stored in a flash-only storage pool for the lowest latency, while less frequently accessed data can be stored in a low-cost, high-capacity pool of spinning disks. With ScaleIO, you can limit the amount of performance—IOPS or bandwidth—that selected customers can consume. The limiter allows for resource distribution to be imposed and regulated to prevent application “hogging” scenarios. Light data encryption at rest can be used to provide added security for sensitive customer data. Finally, ScaleIO offers instantaneous, thinly provisioned writable snapshots for data backups. For both enterprises and service providers, these features enhance system control and manageability—ensuring that quality of service is met.
There are obvious cost benefits to ScaleIO that have been proven to deliver greater than 60% TCO savings. First, the software-only system uses commodity hardware. And because it creates a server-based SAN, there are no dedicated storage components like fabric and HBAs. This allows for reduced power, cooling, and space, which has tremendous cost savings. And because there is no large storage system with ScaleIO, there are no “forklift” upgrades for end-of-life hardware. You simply remove failed disks or outdated servers from the cluster. As mentioned before, the complexity of storage administration is completely eliminated with ScaleIO so you can drastically reduce your administrative overhead. Finally, the software is licensed per TB. You simply pay as you grow with no surprise costs. You never have to buy more storage than you need.
The primary use cases for ScaleIO are virtual server infrastructure (VSI), virtual desktop infrastructure (VDI), databases, and development & testing. Let’s take a closer look at each of these in detail.
Generally, VSI environments require large amounts of storage that can be grown easily. At the same time, they require easy manageability and a low dollar per server cost. ScaleIO is ideal for VSI because it leverages any commodity hardware and can accommodate any size growth. No capacity planning is required, as growth in both capacity and performance is easy and linear. ScaleIO is easy to manage, requiring little administration. With no need for dedicated storage components or expensive arrays, TCO and dollar per server are low.
Generally, VDI environments require high levels of performance at peak times, such as boot storms. They require large amounts of storage that can be grown easily as the number of users increases. At the same time, they require a low dollar per desktop cost. ScaleIO is ideal for VDI because every server in the cluster is used in the processing of I/O operations, eliminating bottlenecks and stabilizing performance. It leverages any commodity hardware and can accommodate any size growth. No capacity planning is required, as growth in both capacity and performance is easy and linear. ScaleIO is easy to manage, requiring little administration. With no need for dedicated storage components or expensive arrays, TCO and dollar per desktop are low.
Generally, database environments require high write performance, high availability, quick recovery, and low cost of storage. ScaleIO is ideal for databases because converged storage and compute allows for very fast writes. Its massive parallelism delivers quick recovery and stable, predictable performance. With no need for dedicated storage components or expensive arrays, TCO is kept low.
Development and testing environments do not require a ton of capacity and do not have to be the best of the best in terms of performance. But they must be low cost, since there is no revenue directly tied to them. Dev/test environments often change rapidly for repurposing. ScaleIO is ideal for development and testing environments. It’s auto-rebalancing, easy scale-out, and elasticity with no downtime are a perfect fit for dynamic environments. Its low initial cost is justifiable for a non-production workload and allows for more investment in what matters—compute.
Now let’s talk about how ScaleIO works by looking at the various components of the architecture and describing their roles.
The first component is the ScaleIO Data Client, or SDC. The SDC is a block device driver that exposes ScaleIO shared block volumes to applications. The SDC runs locally on any application server that requires access to the block storage volumes. The blocks that the SDC exposes can be blocks from anywhere within the ScaleIO global virtual SAN. This enables the local application to issue an I/O request and the SDC fulfills it regardless of where the particular blocks reside. The SDC communicates with other nodes (beyond its own local server) over TCP/IP-based protocol, so it is fully routable. TCP/IP is ubiquitous and is supported on any network. data centers LANs are naturally supported.Note to Presenter: Click now in Slide Show mode for animation.You can see the I/O flow in this animation. The application issues an I/O, which flows through the file system and volume manager, but instead of accessing the local storage on the server (via the block device driver), it is passed to the SDC (denoted as ‘C’ in the slide). The SDC knows where the relevant block resides on the larger system, and directs it to its destination (either locally or on another server within the ScaleIO cluster). The SDC is the only ScaleIO component that applications “see” in the data path. Note that in a bare-metal configuration, the SDC is always implemented as an OS component (kernel). In virtualized environments, it is typically implemented as a hypervisor element or as an independent VM.
The next component in the ScaleIO data path is known as the ScaleIO Data Server, or SDS. The SDS owns local storage that contributes to the ScaleIO storage pools. An instance of the SDS runs on every server that contributes some or all of its local storage space (HDDs, SSDs, or PCIe flash cards) to the aggregated pool of storage within the ScaleIO virtual SAN. Local storage may be disks, disk partitions, even files. The role of the SDS is to actually perform I/O operations as requested by an SDC on the local or another server within the cluster. Note to Presenter: Click now in Slide Show mode for animation.You can see the I/O flow in this animation. A request, originated at one of the cluster’s SDCs, arrives over the ScaleIO protocol to the SDS. The SDS uses the native local media’s block device driver to fulfill the request and returns the results. An SDS always “talks” to the local storage, the DAS, on the server it runs on. Note that an SDS can run on the same server that runs an SDC or can be decoupled. The two components are independent from each other.
ScaleIO’s control component is known as the metadata manager, or MDM. The MDM serves as the monitoring and configuration agent.The MDM holds the cluster-wide mapping information and is responsible for decisions regarding migration, rebuilds, and all system-related functions. The ScaleIO monitoring dashboard communicates with the MDM to retrieve system information for display. The MDM is not on the ScaleIOdatapath. That is, reads and writes never traverse the MDM. The MDM may communicate with other ScaleIO components within the cluster in order to perform system maintenance/management operations but never to perform data operations. This means that the MDM does not represent a bottleneck for data operations and is never an issue in the scaling up of the overall cluster. The MDM consumes resources that are not needed by applications and/or datapath activities. The MDM does not preempt users’ operations and does not have any impact on the overall cluster performance and bandwidth. To support high availability, three instances of MDM can be run on different servers. This is also known as the MDM cluster. An MDM may run on servers that also run SDCs and/or SDSs. The MDM may also run on a separate server. During installation, the user decides where MDM instances reside. If a primary MDM fails (due to host crash, for example), another MDM takes over and functions as primary until the original MDM is recovered. The third instance is usually used both for HA and as a tie-breaker in case of conflicts.
Non-VMware environments including Citrix XenServer, Linux KVM, and Microsoft Hyper-V are identical to physical environments. Both the SDS and the SDC sit inside the hypervisor. Nothing is installed at the guest layer.Since ScaleIO is installed in the hypervisor, you are not dependent on the operating system,so there is only one build to maintain and test. And the installation is easy, as there is only one location to install ScaleIO.
In VMware environments,ScaleIO uses a model that is similar to a virtual storage appliance (VSA), which is called ScaleIO VM, or SVM. This is a dedicated VM in each ESX host that contains both the SDS and the SDC. The VMs in that hostcan access the storage as depicted—to the hypervisor, then to the SVM, and from the SVM to the local storage. All the SVMs are connected, so this allows any VM in any ESX host to access any SDS in the system, asin a physical environment.
Note to Presenter: Click now in Slide Show mode for animation.A read operation originates with a single SDC and involves interaction with a single SDS—a single I/O operation.Note to Presenter: Click now in Slide Show mode for animation.As a result of a read request made by an application that runs on the same server as the SDC, the SDC determines the destination SDS for this request and sends the request to that SDS. The SDS executes the request locally and returns the result to the SDC over the LAN.
Note to Presenter: Click now in Slide Show mode for animation.A write operation originates with a single SDC and involves interaction with two SDSs—two I/O operations.Note to Presenter: Click now in Slide Show mode for animation.As a result of a write request made by an application that runs on the same server as the SDC, the SDC selects two destination SDSs in order to maintain two-copy mirrors. Two messages are sent to two destination SDSs. The recipient SDSs each complete their operation locally and return results to the SDC. The SDC waits until both operations complete before returning a result to the requesting application. So for example, if the application wrote 4KB of data, two 4KB blocks will pass over the network and be written to the media.
Let’s take a look at configuration (or deployment) options. It is common practice to install both an SDC and an SDS on the same server. This way applications and storage share the same compute resources. This slide shows such a fully converged configuration, where every server runs both an SDC and an SDS. All servers can have applications running on them, performing I/O operations via the local SDC. All servers contribute some or all of their local storage to the ScaleIO system via their local SDS. Components communicate over the LAN.
In some situations, an SDS can be separated from an SDC and installed on a different server. ScaleIO does not have any requirements in regard to deploying SDCs and SDSs on the same or different servers. Whatever the preference of the administrator is, ScaleIO works with it transparently and smoothly. Shown here is a two-layerconfiguration. A group of servers is running SDCs and another distinct group is running SDSs. The applications that run on the first group of servers make I/O requests to their local SDC. The second group, running SDSs, contributes the servers’ local storage to the virtual SAN. The first and second groups communicate over the LAN. In a way, this deployment is similar to a traditional external storage system. Applications run in one layer, while storage is in another layer.
At any moment, a whole new group of servers can be added as SDS servers to extend the capacity of the system as a whole. ScaleIO automatically rearranges the data, optimizing and rebalancing the data in the background without any downtime. This deployment can easily grow to thousands of nodes.
This slide shows the power of the distributed architecture of ScaleIO. Every SDC knows how to direct an I/O operation to the destination SDS. There is no flooding or broadcasting. This is extremely efficient parallelism that eliminates single points of failure. Since there is no central point of routing, all of this happens in a distributed manner. Each SDC does its own routing, independent from any other SDC. The SDC has all the intelligence needed to route every request, preventing unnecessary network traffic and redundant SDS resource usage. This is, in effect, a multi-controller architecture that is highly optimized and massively parallel. It allows performance to scale linearly as the number of nodes increases. ScaleIO is capable of handling asymmetric clusters with different capacities and media types.
Similarly, a fully converged configuration will have even higher parallelism and load distribution between the nodes. Any combination of the fully converged and two-layer configuration options is valid and operational.
Now let’s look at the distributed data layout scheme of ScaleIO volumes. This scheme is designed to maximize protection and optimize performance. On the left, you see a data Volume 1 in grey and a data Volume 2 in blue. On the right, you see a 100-node SDS cluster.A single volume is divided into chunks of reasonably small size, say 1 MB. These chunks will be scattered (striped) on physical disks throughout the cluster, in a balanced and random manner.Note to Presenter: Click now in Slide Show mode for animation.Once the volume is provisioned, the chunks of Volume 1 are spread throughout the cluster randomly and evenly.Note to Presenter: Click now in Slide Show mode for animation.Volume 2 is treated similarly.Note that the slide shows partial layout. Ideally, the chunks are spread over the all the 100 servers. It is important to understand that the ScaleIO volume chunks are not the same as data blocks. The I/O operations are done at a block level. If an application writes out 4KB of data, only 4KB are written, not 100 MB. The same goes for read operations—only the required data is read.
Now let me explain the ScaleIO two-copy mesh mirroring. For simplicity, I will illustrate it with Volume 2, which only has five chunks—A, B C, D and E. The chunks are initially stored on servers as shown. In order to protect the volume data, we need to create redundant copies of those chunks. We end up with two copies of each chunk. It is important that we never store copies of the same chunk on the same physical server.Note to Presenter: Click now in Slide Show mode for animation.The copies have been made. Now, chunk A resides on two servers: SDS2 and SDS4. Similarly, all other chunks’ copies are created and stored on servers different from their first copy. Note that no server holds a complete mirror of another server. The ScaleIO mirroring scheme is referred to as mesh mirroring, meaning the volume is mirrored at the chunk level and is “meshed” throughout the cluster. This is one of the factors in enhancing overall data protection and cluster resilience. A volume never fails in full and rebuilding a particular damaged chunk (or chunks) is fast and efficient, as it is done simultaneously by multiple servers. When a server fails (or is removed from the cluster), its chunks are spread over the whole cluster and rebuilding is shared among all the servers.
Let’s take a look at a server failure scenario. SDS1 presently stores chunks E and B from Volume 2 and chunk F from Volume 1. Note to Presenter: Click now in Slide Show mode for animation.If SDS1 crashes, ScaleIO needs to rebuild these chunks, so chunks E, B and F are copied to other servers. This is done by copying the mirrors. The mirrored chunk of E is copied from SDS3 to SDS4, the mirrored chunk of B is copied from SDS6 to SDS100, and the mirrored chunk of F is copied from SDS2 to SDS5. This process is called forward rebuild. It is a many-to-many copy operation. By the end of the forward rebuild operation, the system is again fully protected and optimized. No matter what, no two copies of the same chunk are allowed to reside on the same server. Clearly, this rebuild process is much lighter-weight and faster than having to serially copy an entire server to another. Note that while this operation is in progress, all the data is still accessible to applications. For the chunks of SDS1, the mirrors are still available and are used. Users experience no outage or delays. ScaleIO always reserves space on servers for failure cases, when rebuilds are going to occupy new chunk space on disks. This is a configurable parameter (i.e., how much storage capacity to allocate as reserve).
Protection domains are an important feature of ScaleIO. Protection domains are sets of servers or SDS nodes. The administrator can divide SDSs into multiple protection domains of various sizes, designating volume to domain assignments. As the name implies, data protection (redundancy, balancing, etc.) is established within a protection domain. This means that all the chunks of a particular volume will be stored in SDS nodes that belong to the protection domain for this specific volume. Volume data is kept within the boundaries of the protection domain. Any application on any server can access all the volumes, regardless of protection domain assignment. So an SDC can access data in any protection domain. It is important to understand that protection domains are not related to data accessibility, only data protection and resilience. Protection domains allow for: Increasing the resilience of the overall system by tolerating multiple simultaneous failures across the overall deployment. Separation of volumes for performance planning—for example, assigning highly accessed volumes in “less busy” domains or dedicating a particular domain to an application. Data location and partitioning in multi-tenancy deployments so that tenants can be segregated efficiently and securely. Adjustments to different network constraints within the cluster.
Within a given protection domain, you can select a set of storage devices and designate them as a storage pool. You can define several storage pools within the same protection domain. When provisioning a data volume, you can assign it to one of the storage pools. Doing so means that all the chunks of this volume will be stored in devices belonging to the assigned storage pool. With protection domains and storage pools, ScaleIO establishes a strict hierarchy for volumes. A given volume belongs to one storage pool; a given storage pool belongs to one protection domain. The most common use of storage pools is to establish performance tiering. For example, within a protection domain, you can combine all the flash devices into one pool and all the disk drives into another pool. By assigning volumes, you can guarantee that frequently accessed data resides on low-latency flash devices while the less frequently accessed data resides on high-capacity HDDs. Thus, you can establish a performance tier and a capacity tier. You can divide the device population as you see fit to create any number of storage pools. The pools’ boundaries are “soft” in that the admin can move devices from pool to pool as necessary. ScaleIO responds to such shifts in pool assignments by rebalancing and re-optimizing. No user action is required to reconfigure and rebalance the system—it is automatic and fast. This ease and simplicity of movement allows the admin to introduce temporary enhancements on a whim. For instance, you can move a couple of devices from pool1 to pool2 (or from one protection domain to another) for a limited period to address an expected (or experienced) peak demand. The situation can later be reversed with no downtime or significant overhead. It’s that simple.
A snapshot is a volume thatis a copy of another volume. Snapshots take no time to create (or remove); they are instantaneous. Snapshots do not consume much space initially because they are “thinly provisioned.” You can create snapshots of snapshots—any number of them. The slide shows a volume, V1, a snapshot of that volume, S111 and a snapshot of snapshot S111, S121. Snapshot volumes, which are fully functional, can be mapped to SDSs just like any other volumes. A complete genealogy of volumes and their snapshots is known as VTree. Any number of VTrees can be created in the system. When you create a snapshot for several volumes (or snapshots), a consistency group that contains all the volumes in that operation is created and named. Consistency groups are automatically created when issuing a snapshot command for several volumes. Operations may be performed on an entire consistency group (for example, delete volume).Note to Presenter: Click now in Slide Show mode for animation.The slide shows two genealogies of volumes, V1 and V2. V1 is being used to create VTree1 of snapshots. V2 is used to create the VTree2 genealogy. Note to Presenter: Click now in Slide Show mode for animation.At some point, a command is issued to create two snapshots, one of V1 and the other of V2. Because this is a single snapshot command, the two newly created snapshot are grouped. S112 (of V1) and S211 (of V2) have been grouped together in the C1 consistency group.
ScaleIO features a capability known as the limiter—a configurable option to limit resource consumption by applications. The limiter provides the ability to limit specific customers from exceeding a certain rate of IOPS and/or a certain amount of bandwidth when accessing a certain volume. On the slide, you see three applications sharing compute resources while accessing the same volume. The amounts of consumed resources are represented by the size of the colored boxes. Initially, they are all the same; the division of resources is equal. Some available compute resources exist, which are not currently being consumed by the three applications.Note to Presenter: Click now in Slide Show mode for animation.Now let’s say App 3 has become “hungry” and has consumed all of the available resources. Apps 1 and 2 have no resources left to consume, should they need it. They are at risk.Note to Presenter: Click now in Slide Show mode for animation.Now App 3 is consuming so much IOPS and bandwidth that it is eating into the compute power that Apps 1 and 2 require. So their performance is now suffering due to App 3’s “hogging.” Note to Presenter: Click now in Slide Show mode for animation.With the ScaleIO limiter applied, however, App 3 is limited in the amount of resources it can consume. So Apps 1 and 2 can operate as their SLA is defined. So as you can see, the limiter allows you to allocate IOPS and bandwidth as desired in a controlled manner. Protection domains, storage pools, and the limiter allow the administrator to manage resource efficiency within the ScaleIO cluster. These tools allow the administrator to regulate and condition the system, thereby optimizing its overall operation.
Your data center contains lots of servers, many of which contain local disks that come standard when you purchase the servers. So from a capital perspective, you have already paid for this local storage, which is spinning continuously and consuming precious power and cooling resources. Note to Presenter: Click now in Slide Show mode for animation.But if you’re not using them for data, they’re providing no value whatsoever. Note to Presenter: Click now in Slide Show mode for animation.So why not rescue that capacity and put it to good use?