GDS International Scality Simply Scaling Storage


Published on

Scality’s RING Organic Storage is a market-proven, cost-effective software solution for
the storage of unstructured data at petabyte scale. Scality RING is based on patented
object storage technology, which delivers high availability, ease of operations and total
control of your data. It is capable of handling billions of files, without the hassle of
volume management or complex backup procedures.
The organic design creates a scale-out system with distributed intelligence that has no
single point of failure. As a result, RING Organic Storage is resilient and self healing,
and technology refreshes do not require any data migration or downtime. Thanks to its
parallel design Scality RING delivers very high performance for file read and write
1 Introduction
Using only off-the-shelf components, Scality RING provides reliable, enterprise grade
mechanisms for data protection and continuity of service. Thanks to an intelligent mix of
replication and erasure coding technologies, data durability * beyond twelve nines
(99.9999999999%) can be reached.
The RING Organic Storage solution uses a distributed, decentralized and geo-redundant
peer-to-peer architecture where data is evenly spread among all participating storage
servers. The system is an aggregation of independent, loosely coupled servers in a
“shared nothing” model, logically unified in a “ring” to provide unique linear scalability,
cost efficiency and data protection.
The Ring also provides exceptional fault tolerance, protecting against all types of
outages (disk, servers, silent error corruption, network, power…). It ensures high
availability thanks to intelligent data placement within one site or across multiple sites. It
accomplishes this without the use of a central database, which would otherwise conflict
with the Ring philosophy of eliminating all potential single points of failure.
The essence of to Ring’s performance is its massively parallel architecture, fully utilizing
all (potentially heterogeneous) storage servers in order to sustain very high aggregated
data transfert rates and IOPS levels. It is designed to scale linearly to thousands of
storage servers.
Scality has developed a unique approach, primed for the exponential growth in storage
demand. The Ring allows you to grow in step with your business needs without having
to worry about costly and complex hardware refresh cycles.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

GDS International Scality Simply Scaling Storage

  1. 1. Scality, Simply Scaling Storage White Paper March 2012
  2. 2. White Paper “Scality, Simply Scaling Storage” ContentsSummary........................................................................................................................... 2  1   Introduction................................................................................................................. 2  2   Stakes and challenges ............................................................................................... 3  3   What is object storage? .............................................................................................. 5  4   Scality RING Organic Storage.................................................................................... 6   4.1   Philosophy ........................................................................................................... 6   4.2   Architecture and components .............................................................................. 8   4.3   Performance ...................................................................................................... 13   4.4   Data Protection .................................................................................................. 16   4.5   Consistency model ............................................................................................ 19   4.6   Scalability........................................................................................................... 20   4.7   Management and Support ................................................................................. 22   4.8   Partners and ecosystem .................................................................................... 23  5   Conclusion: Scality, a new vision of storage ............................................................ 24  References...................................................................................................................... 25  Glossary .......................................................................................................................... 28   Scality Copyright 2012. Private and Confidential. 1
  3. 3. White Paper “Scality, Simply Scaling Storage”SummaryScality’s RING Organic Storage is a market-proven, cost-effective software solution forthe storage of unstructured data at petabyte scale. Scality RING is based on patentedobject storage technology, which delivers high availability, ease of operations and totalcontrol of your data. It is capable of handling billions of files, without the hassle ofvolume management or complex backup procedures.The organic design creates a scale-out system with distributed intelligence that has nosingle point of failure. As a result, RING Organic Storage is resilient and self healing,and technology refreshes do not require any data migration or downtime. Thanks to itsparallel design Scality RING delivers very high performance for file read and writeoperations.1 IntroductionUsing only off-the-shelf components, Scality RING provides reliable, enterprise grademechanisms for data protection and continuity of service. Thanks to an intelligent mix ofreplication and erasure coding technologies, data durability * beyond twelve nines(99.9999999999%) can be reached.The RING Organic Storage solution uses a distributed, decentralized and geo-redundantpeer-to-peer architecture where data is evenly spread among all participating storageservers. The system is an aggregation of independent, loosely coupled servers in a“shared nothing” model, logically unified in a “ring” to provide unique linear scalability,cost efficiency and data protection.The Ring also provides exceptional fault tolerance, protecting against all types ofoutages (disk, servers, silent error corruption, network, power…). It ensures highavailability thanks to intelligent data placement within one site or across multiple sites. Itaccomplishes this without the use of a central database, which would otherwise conflictwith the Ring philosophy of eliminating all potential single points of failure.The essence of to Ring’s performance is its massively parallel architecture, fully utilizingall (potentially heterogeneous) storage servers in order to sustain very high aggregateddata transfert rates and IOPS levels. It is designed to scale linearly to thousands ofstorage servers.Scality has developed a unique approach, primed for the exponential growth in storagedemand. The Ring allows you to grow in step with your business needs without havingto worry about costly and complex hardware refresh cycles.* Data Durability: According to Amazon web site, Data Durability is defined as follows: “durability (withrespect to an object stored in S3) is defined as the probability that the object will remain intact andaccessible after a period of one year. 100% durability would mean that theres no possible way for theobject to be lost, 90% durability would mean that theres a 1-in-10 chance, and so forth.” Scality Copyright 2012. Private and Confidential. 2
  4. 4. White Paper “Scality, Simply Scaling Storage”Compared to other commercial storage systems, the Ring cuts the TCO by 50%*. This ismade possible by the use of commodity servers, simplified operations and a flexiblepricing model.2 Stakes and challengesIncreasing data challenges have prompted a proliferation of innovative technologicalresponses, but until recently, these have only been point solutions. This has createdislands of storage infrastructures of all types, with mixed presence of storage arrays(SAN†), file servers (NAS‡), backup or archive appliances. This explosion of complexityis a source of cost due to low utilization rates, separate management integrations, hard-to-measure SLA and the inexistence of upgrade paths. Figure 1: Enterprise reality: islands of storage with a mix of vendors, models, technologies and versionThe natural response of corporate decision makers is to reduce this complexity byfocusing solely on standard or market-proven solutions. Among these solutions, a moveto SAN-based storage was once thought to provide the best value—in particular for highperformance requirements. However, SAN is historically based on expensive FibreChannel connectivity and more recently, the industry started to react with an interestingevolution with a combination of SAN and Ethernet based on iSCSI protocol.* See† SAN: Introduced mid nineties, Storage Area Networks allow connections between servers and storageunits via an inteliigent network historically based on Fibre Channel and more recently with SCSI over IPaka iSCSI.‡ NAS: Network Attached Storage is basically a file server most of the time presented as a dedicatedappliance. Scality Copyright 2012. Private and Confidential. 3
  5. 5. White Paper “Scality, Simply Scaling Storage”For a few decades, RAID has been considered the standard mechanism for disk-baseddata protection. As the RAID groups have become denser, it has become necessary tointroduce disk groups with double parity, namely RAID6, to limit exposure to data loss.Although this protection level has an interesting disk overhead ratio, a good durabilityand thus a low probability of data loss, its availability and performance in case of diskloss remains the main concern especially with large configuration.In addition, the use of a local or network file system presents some advantages, such asthe ability to allocate, name and classify information, but it also introduces some truelimits. File system characteristics often impose various limits such the maximum file sizeor the file system size, as well as the number of inodes. An initial advantage can evenbecome its opposite. For example, the search for a data item, such as a directory entryfor a file, requires the navigation of the file directory itself. This sequential navigation,although rapid and satisfying for a small number of items, quickly becomes a stiflingbottleneck with large data volumes.Attempts to overcome these limitations have included the use of shared clusters withconcurrent access to a file system. However, these methods have also quickly reachedtheir limits, as the shared disk model has been able to successfully handle only a fewdozen nodes. A “shared nothing” approach (which does not share disks) does push backthe threshold of performance degradation beyond previously known limits, withcapabilities to aggregate thousands or tens of thousands of nodes. Applied to the NASworld, one of the current market leaders offers a very scalable solution in terms ofperformance and data redundancy, with a capacity of several PB, but the offer remainsrestricted to 144 nodes due to network limitations. Scality Copyright 2012. Private and Confidential. 4
  6. 6. White Paper “Scality, Simply Scaling Storage”3 What is object storage?Traditional access methods, such as file sharing or block protocols, have their ownproperties, advantages and limitations. Beyond SAN, NAS and scale-out NAS*, a fourthway, object storage, has been proposed by the industry.An object is a logical entity that groups data and metadata, using the latter formanagement and descriptions of the stored data. The key point involves establishmentof exchange protocols for interaction between clients and servers. There has beenseveral such protocols proposed in the industry, but the one which seems to offer thebest characteristics, is the association of keys and values. This is what Scality is basedon.This approach has several simultaneous advantages: 1. The protocol is pretty simple to implement thus providing reliability. 2. The performance has a linear response, often based on calculation, and therefore predictive—without the need for lookup or data centralization. 3. The scalability is no longer constrained by the limits of file system or block storage. 4. The guarantee of the independence of data location and the integration of data redundancy by coupling multiple copy mechanisms provides robustness. 5. The geo or stretched configurations with an Internet-like topology are feasible with object storage, but impossible to obtain with block mode or file systems. 6. The consistency is simplified without the need to follow strict rules introduced by block storage or historical file system design.The object storage concept relies on a value known from outside the system, such as apath or the content of a file, from which a unique key is calculated. The key serves tolocate the data, without the response times being affected by the multiplication ofsystems for size constraints.Storage systems known as Content Addressable Storage (CAS) have emerged over thepast decade. These systems are key-value based object-store with the specificity thatkeys are directly derived (hashes) from the content. Typically, MD5 or SHA-1 algorithmsor a combination of both are used to uniquely identify an object.Over the past few years, these storage systems have penetrated the market forarchiving data, and several mature offerings exist. They are very efficient for suchworkload since identical objects have the same key, but they have many drawbackswhen used as a general file store.More recently, it was recognized that a more general implementation of key-value basedobject storage, where the key is not construed by the storage system, but chosen by the* Scale-out NAS: This is the capability to provide a logical aggregation between multiple file servers orNAS to provide an uniform file access and representation like one server can provide. The complexity ofthe configuration is completely hidden. This approach addresses critical and high-demanding environmentboth in term of performance and availability. Scality Copyright 2012. Private and Confidential. 5
  7. 7. White Paper “Scality, Simply Scaling Storage”application would be much more useful for a wide variety of applications. Amazon ledthe way with Amazon S3 service in 2006.Scality has developed such a modern key-value based object store, and has addedperformance, making RING Organic storage an ideal technology for file storage at scale,for any use case.4 Scality RING Organic Storage4.1 PhilosophyScality, a storage infrastructure software vendor, developed the RING Organic Storagesolution to address business and IT challenges associated with exponential data growth.It offers the capability to process huge numbers of transactions and store volumes ofdata at the petabyte scale and beyond.Scality’s philosophy is based on strong principles that are the building blocks of ahyperscalable and high performance infrastructure, assembled from very low coststandard components. It is about constructing a virtually infinite and elastic logicalstorage space based on standard commodity servers and disks. Figure 2: multidimensional scalability: capacity only, performance only or both Scality Copyright 2012. Private and Confidential. 6
  8. 8. White Paper “Scality, Simply Scaling Storage”Scality targets two sets of objectives: 1. IT efficiency a. using an innovative approach to deliver availability, performance and scalability b. capability to deliver performance or capacity independently of the other without any compromise on availability 2. Low TCO (producing a real and immediate ROI).Figure 3 illustrates a perfect analogy between a traditional array with common elements(controllers and storage units) and Scality RING components such as connectors andstorage servers. Figure 3: Scality RING Organic Storage high level architecture versus traditionnal storageTo fulfill these objectives, Scality RING is based on: • Load sharing by several elementary units functioning in parallel (“Divide and Conquer” principle), • An independent and decentralized approach pioneered by peer-to-peer (P2P) networks (“Divide and Deliver” principle) • The distribution of objects written to multiple nodes (“Divide and Store” principle). Unlike the typical implementation of static distribution of data, Scality does not require any prior data sharding by the user/admin or the application. All of the splitting is implicit, transparent and integrated with the solution, without any impact on application users. Scality Copyright 2012. Private and Confidential. 7
  9. 9. White Paper “Scality, Simply Scaling Storage” Figure 4: Scality RING deployment example with 2 storage tiers, a DR site and multiple connectors as access layer managed by the Supervisor4.2 Architecture and componentsFrom the start, Scality has based its development on the implementation of universityresearch linked to the distribution and persistence of objects in a network of elementarysystems. The ability to access any distributed objects rapidly is a primary requirement.The Chord protocol1, 2, invented at MIT, is a second-generation P2P system leveragedby Scality to map stored objects into a virtual keyspace. Unlike the “unstructured” firstgeneration P2P systems, such as Gnutella, with requests being broadcast to different“producers” of storage, the second generation, known as structured P2P, relies on theeffective routing of a request to the node owning the data being requested.Scality has subsequently developed and extended Chord beyond data distribution. It hasadded functional components in order to reach enterprise-level performance andreliability, affording access time reduction, object persistence guarantees and self-healing capabilities. These components cover the generation of data and server keys,their format and the concept of logical servers, called “storage nodes”, in place ofphysical servers. These logical servers are instantiated through independent UNIXprocesses, responsible for part of a physical machine address space. Scality Copyright 2012. Private and Confidential. 8
  10. 10. White Paper “Scality, Simply Scaling Storage”With these developments Scality implements an essential ingredient for handlinginfrastructure load increases, and automates the redistribution of object keys from afailing server to other servers and nodes. Each storage node has an automaticallyassigned key and acts as a simple key/value store. Scality develops flexible and efficientkey generation and allocation mechanisms that integrate fault tolerance, the number ofcopies required and the topology of the Ring (racks, single or multisite). Server keys alsoapply the concept of “class of storage” to stay aligned with application needs.Scality’s load balancing algorithms result in a keyspace that is uniformly distributed overthe cluster of nodes present. These algorithms prevent collisions of data replicas inoperational condition but also after possible disk failures or servers.. The capability toglue together different server configurations is also important; some servers could existin the system from the beginning, alongside new machines with different capacities orperformance characteristics.Scality has implemented a dispersal technology that guarantees that all object replicasare stored on different node processes and on different physical storage servers,potentially in different datacenters. This guarantee is maintained even after a disk orserver failure.The independence of nodes, servers and keys is what enables Scality’s RING system toguarantee an availability of 99.999%, which qualifying the solution as Carrier Grade, andsuitable for the most demanding environments.Scality envisions a completely object-oriented philosophy to overcome traditional limits,and support data storage on the order of several PB and more. The revolution proposedby Scality requires only the use of inexpensive, standard x86 computers as basestorage servers. This goes well beyond traditional approaches that rely on centralizedindexes, catalogs or DBs of small entities, such as inodes or blocks of data.The paradigm shift is here: delivery of enterprise quality storage with standard serversrather than the traditional approach, with yet another array of hyperscalable disks. Theaggregation of standard machines and their logical unification in the form of a ringconforming to a P2P philosophy offers the promise of an always-on system withunlimited scalability. It is a system capable of growing at the speed of all the demandingapplications connected to the Internet.The core of the Scality solution resides in it’s unique disitrbuted architecture andintelligent self-healing software. There is no centralization of requests, no unique catalogof data, no hierarchy of systems, and therefore no notion of master or slave. Theapproach here is purely symmetric. All nodes have the same role and run the samecode.Scality RING is complemented by two additional systems: The supervisor, by which theinfrastructure is managed, and the connectors that link the Ring with consumers (oftenrepresented by applications or client terminals). The supervisor is covered by thechapter 4.7.The connectors take on the role of data translation between a typically standardinterface (either open and widely deployed, or very specialized for various business orvertical needs) and the object model as understood by the Ring. Several types ofconnectors exist. These can be clustered to obtain a redundant and highly performant Scality Copyright 2012. Private and Confidential. 9
  11. 11. White Paper “Scality, Simply Scaling Storage”access layer and also co-located with the application itself. The Scality RING storageinfrastructure provides a scale-out architecture from a connector or storage standpoint. Figure 5: Different infrastructure elements supporting Scality RING object storageInternally, Scality software relies on two layers: Scality Core Framework that allowsdevelopers to envision an asynchronous event-based exchange model, and ScalityRING Core, a higher layer that implements distributed logic and manages keyspaces,self healing and a hierarchy of any number of storage spaces. Figure 6: – The two functional layers of Scality RING in user modeIn the jargon that Scality introduces, the term “node” is different from a system or aphysical server. Several nodes can run on a single server. A configuration of six nodesper server is often recommended, and their existence is purely logical, being justprocesses in the Unix/Linux sense of the term. They are all independent of each othereven if they operate on the same server. These storage node instances control theirportion of the global keyspace to locate data and honor object requests. Scality Copyright 2012. Private and Confidential. 10
  12. 12. White Paper “Scality, Simply Scaling Storage”The nodes are running on the storage servers and are responsible for a portion of theglobal ring, with each node in charge of an even portion of the global keyspace. Figure 7illustrates an example with a ring of 12 nodes created from four storage servers, witheach one operating 3 storage nodes. Each server is thus responsible for 1/4 of thekeyspace. The maximum supported configuration is 6 storage nodes per storage server. Figure 7: From physical servers to storage nodes organized in a logical ring (A0…D2 are symbols to illustrate that 2 consecutive nodes don’t come from same physical server)Keys with a size of 20 bytes are generated by the connectors, which assign them todifferent nodes. This establishes a ring with a fair balanced policy thanks to a dispersionfactor present in the key itself. Figure 8: Format of a Scality key with 160 bits, of which 24 bits are reserved for key dispersalEach node functions like a very simple key/value store. The key doesn’t embed locationinformation but the Chord algorithm always maps a key to a specific node at any giventime. Also, the internal logic of each node determines the appropriate location of objectdata on disk. Keys always contain either a hashed or randomly generated prefix, leadingto a balanced distribution of data among all the nodes based on consistent hashingprinciples.Another essential point about the Scality solution is the notion of decentralization andthe independence of the nodes, since nodes do not defer to a central intelligence. As apeer-to-peer architecture, any node can receive a request for a key. The path to the right Scality Copyright 2012. Private and Confidential. 11
  13. 13. White Paper “Scality, Simply Scaling Storage”node follows the equation ½ log2(n), with n being the number of nodes in the Ring,when topology changes. A key is assigned to a node which has the responsibility tostore the objects whose keys are immediately inferior to its own key and superior to thekeys of the preceding node on the Ring.At the heart of the same system, the In/Out daemons, known as iod, are responsible forthe persistence of data on physical media. Their role is therefore to write the datapassed to the node on the same machine, monitor storage and ensure durability. Eachiod is local to one machine, managing local storage space, and communicating only withthe nodes present on that same machine. There is therefore no exchange between anode of a machine and the iod of another machine.There can be multiple iods running on the same physical machine. A typicalconfiguration is one iod per disk. The iods are the only links between the physicalresidence of a data entry and the layer of services represented by the nodes andconnectors.The maximum number of iods on a server is 255, enough to support a very large loadlocally. Physical storage local to a server consists of regular partitions formatted with thestandard ext3 file system.Each iod controls its own file system and the data containers placed above it. These filecontainers are, in fact, elementary storage units of the Ring that receive written objectsdirected by the iod from requests to the nodes initiated by any connector. They storethree types of information: the index to locate the object on the media, object metadata,and the object data itself. The unique connector-Ring-iod architecture provides a totalhardware and network abstraction layer, with connectors at the top acting as an entrygate to the Ring. The nodes of the Ring act as storage servers and iod daemons asstorage producers responsible for the physical I/O operations. Figure 9: Different elements of a storage serverThe Scality philosophy is all about delivery of a storage infrastructure that doesn’tcompromise on the three dimensions – performance, availability and scalability – evenas they evolve over time based on end user or service provider requirements. Scality Copyright 2012. Private and Confidential. 12
  14. 14. White Paper “Scality, Simply Scaling Storage”4.3 PerformancePerformance is measured based on three criterias: latency, throughput and the numberof operations per second. Latency is the time required to receive the first byte of data inanswer to a READ request. It’s a function of network and storage media speed as wellas the Ring distributed storage protocols.IOPS is the number of operations that can be performed per second. It can be measuredas either object IOPS (READ, WRITE, DELETE of entire objects) or physical mediaIOPS (disk blocks per second). Associated with this is the aggregate throughput of thetransmission in Gbit/s, also known as the bandwidth.The Scality RING storage platform is designed to adapt to different workloads, oftenquite variable. This can include relatively small operations (several KB) with a largenumber of transactions per second, or conversely, large operations (several MB) with arelatively smaller number of transactions.The high degree of parallelism among all nodes, physical servers and connectors is astrong performance differentiator of the RING platform. The “Divide and Conquer”philosophy is implemented by the spreading of data among storage nodes running onseveral storage servers and making sure that all available resources are being utilized.To find where a file resides, legacy storage systems must go through centralizedcatalogs. By contrast, the Ring uses a very efficient distributed algorithm based onobject keys, and shows a very linear response to increased load requests. It scaleslogarithmically with the number of servers.Latency is perfectly predictable, and the time to locate data in a Ring remainsappropriately small. Response times remain flat even with increased number of serversor objects.The second generation peer-to-peer Chord protocol doesn’t need all nodes to know thecomplete list of other nodes (the peers) in the configuration. Chord’s main quality residesin its routing efficiency.In the original Chord protocol as documented by MIT, each node just needs theknowledge of its successor and predecessor nodes that are organized in the Ring. Thus,updates to data do not require the synchronization of a complete list of tables on everynode, but still avoid the risk of stale information. Scality Copyright 2012. Private and Confidential. 13
  15. 15. White Paper “Scality, Simply Scaling Storage” Figure 10: Connector lookup method based on a very efficient Chord algorithmUsing the intelligence of the Chord protocol, the ring provides complete coverage of theallotted keyspaces.An initial request is internally routed within the Chord ring until the right node is located.Multiple hops can occur, but the two key pieces of information – predecessors andsuccessors – reduce the latency needed by the protocol to locate the right node. Whenthe desired node is found, the node receiving the original request provides thisinformation directly back to the connector.Figure 10 illustrates a simple lookup request from the connector. Key #33 is requestedby the connector and this machine knows only keys 10 and 30. The connector picks thefirst information and connects to that node, i.e 10. Node 15, 25, 45 and 85 are thencontacted. The protocol determines that node 25 connected to node 35 matches therequest for key 33. Node 25 sends back the information to node 10, and then to theconnector.The Scality RING implementation of Chord modified the original algorithm so that: • Each node knows its immediate successor and all power of 2 nodes up to the half in term of hops. • As a consequence, most of the time, 1 hop is needed to find the data. In the case of a topology change, the number of hops follows the equation 1/2log2(n), with n being the number of nodes in the Ring. That leads to 4 hops maximum for a 100 node Ring, and only 5 hop maximum for a 1000 node Ring. • When changes occur such as an insert or failure of a node, a proxy mechanism is started with a balance job to maintain the Ring in an optimized topology.Globally, the routing table rarely changes even after an insert. The infrastructure doesn’tneed to pause, stop or freeze the environment when storage servers are added. Scality Copyright 2012. Private and Confidential. 14
  16. 16. White Paper “Scality, Simply Scaling Storage”When a failure occurs, it is handled like a cache miss, and the lookup process feeds thecache line again after determining the new route to the data. Scality allows seamlesstopology changes as nodes join and leave the infrastructure.The Ring can continue toserve queries even while the system is continuously changing. Most of the time duringnormal operations, the mapping of connector-key-node is direct, and the performance isoptimal.Scality RING leverages many recent technological innovations. For example, the abilityto integrate Flash-based storage or SSD units in storage servers and connectors helpsto deliver ever greater performance*. Similarly, improvements can be obtained with theintegration of 10 GbE or higher network cards, and the configuration of multiple portsand high-frequency multicore CPUs. The integration of new elements is possible withoutdisturbing or unbalancing the solution, and of course, without service interruptions.The absence of centralization removes any potential bottlenecks in one or more servers(such as might occur with a server monopolizing access information).Thus the Scalitysystem avoids any single points of failure (SPOF) that would otherwise have thepotential to disrupt the entire system. Performance extends beyond its usual definitioninvolving speed of throughput to also include service availability. Rapid and automaticrecovery from an error or breakdown is also considered by Scality to be a key measureof performance. Server Nodes Software Nodes 4kB get Objects/sec 4kB put Objects/sec 3 36 41,573 26,274 4 48 51,882 33,278 5 60 60,410 39,160 … … … … 24 288 385,000 257,000 Source ESG Figure 11: Performance results in Objects/sec unitsAuto-TieringScality provides its own storage tiering technology, embedded within the RING, namedAuto-Tiering. The method ensures the right alignment between the value of the data andthe cost of storage where the data resides. It operates at the object level and is thereforeindependent of the data structure used by the application and can therefore be appliedto many various IT environments. The RING receives different criterias to apply andoperate the data movement. The policy engine performs autonomously andautomatically the migration and movement of objects within a single RING or betweenRINGs. The object key continues to reference the same location and is completelyseamless for the user and the application. Various configurations can be imagined suchas storage consolidation with a N-1 model where N is the number of primary RING.* See Scality Lab Report by ESG on the performance of Scality RING when using SSD storage. Scality Copyright 2012. Private and Confidential. 15
  17. 17. White Paper “Scality, Simply Scaling Storage”These N RINGs migrate data to only one secondary RING and share its potentialmassive capacity.This fundamental function for todays data centers allowsto configure, for example, a firstRING with SSDs. That RING provides fast data access and stores only 10 to 20% of theentire data volume. It is connected to a second RING, less accessed, with more capacityfulfilled by SATA drives, thus delivering a higher response time but adequate. Figure 4illustrates this Auto-Tiering function between 2 RINGs within the same data center. Thisoptional feature fully reaches the financial goals of reducing costs of storageinfrastructures and optimizing storage service.4.4 Data ProtectionScality RING provides several mechanisms for protecting data and the infrastructure onwhich it operates: Replication and ARC, the new mode for Advanced ResilienceConfiguration.ReplicationScality offers a built-in Replication mode within the Ring to provide a seamless dataaccess even in case of failures. The data are copied in native format without anytransformation which offers a real performance gain. Scality Replication offers multipleobject copies, called replicas, across different storage nodes, with the guarantee thateach replica resides on a different storage server thanks to the dispersion factorexpressed in its key (the first 24 bits of each key). The mechanism developed by Scalityinvolves projection guarantees that determine independent target nodes for additionaldata copies. The maximum number of replicas is 6, although typical values are 3 or 4.Additionally, an option exists to enable replication across multiple rings, on same orremote sites with the flexible choice of unidirectional or multi-directional mode. Figure 12: Replication with 2 replicas Scality Copyright 2012. Private and Confidential. 16
  18. 18. White Paper “Scality, Simply Scaling Storage”Advanced Resilience ConfigurationIn addition to replication,Scality developed Advanced Resilience Configuration known asARC, its erasure-code technology based on IDA (Information Dispersal Algorithm) wellknown and proven in Telecommunication for a long time. Its reconstruction mechanismis based on Reed-Solomon3 error correcting theory. ARC is a new feature option runningwithin the RING to protect data intelligently against disks, servers, racks or sites failures.This new configuration mode reduces the number of copies and avoids double or triplessimilar information. Therefore this mechanism cuts significantly the hardware CapEx aswell as the related OpEx.As a brief description of Scality ARC, lets consider n objects which need to be storedand protected. For the simplicity of the demonstration, we will assume that they are all 1MB in size. We assume we want to protect against k failures (which we note ARC (n,k)).Scality ARC will store each of the n pieces of content individually, and will in additionstore k new objects which are called checksums. These checksums are mathematicalcombinations of the original n objects, in such a way that we can reconstruct all the nobjects despite the loss of any k elements, whether objects or checksum. With ScalityARC, each of the k checksum would be 1 MB in size, giving a protection against k disksor servers loss, with just an extra k MB of storage required.To illustrate benefits and the mechanism behind this theory, we use the followingexample: Figure 13: Scality RING Advanced Resilience Configuration with (16,4) modelHere with just 4 MB of additional storage, we have protected 16 MB against 4 disk orserver loss, that’s an overhead of 4/16=25%, much lower than the 200% overhead ofreplication, and for a much better protection than RAID 6. Note that erasure codeprotects against server loss as well as disk loss which is not the case of RAID.Traditional implementation of erasure code technology to storage introduces a “penaltyon read” meaning that the reader must read several pieces then extract data to receiveoriginal information. This is the drawback of the dispersed storage approach, inintroduces a 200 – 300 ms latency to serve the data. To avoid that overhead and largenumber of IOPS, Scality chooses to store original data fragments with checksumsfragments independently. Scality implements a (16,4) model referred as ARC(16,4)meaning that redundancy covers additional 25% of the data space. If we compareReplication and ARC in term of cost-effectiveness for same level of redundancy, Scality Copyright 2012. Private and Confidential. 17
  19. 19. White Paper “Scality, Simply Scaling Storage”Replication with 3 copies needs 2 extra storage space and ARC(16,4) just needs 25%more space which represents 8 times less storage or just 12.5% of the replication casefor the storage dedicated to the protection. Globally for data and protection space, ARCimplementation represents 2.4 times less storage then total configuration with replication.This is what Scality has implemented and recommends for large scale configuration. Sonow, it is possible to get 1 000 000 better reliability than RAID 6, with no more diskoverhead and no performance bottleneck. In comparison with dispersed storage, Scalityapproach avoids the penalty on read and continues to offer the best response time witha direct read operation on data. Figure 14: Scality RING ARC example vs Replication and Dispersed approachesWe summarize some results in the following table with 1PB storage configuration withRAID5, RAID6, Replication and Scality ARC. RAID 5 with 5 data disks and 1 parity diskdemonstrate a very good storage overhead but a poor configuration for data durability.RAID 6, with 6 data disks and 2 parity disks, presents a big gain with an interestingdurability but a very limited tolerance of disks failure. This limitation demonstrates thatRAID 6 is not the right choice for large configuration especially ones beyond 1PB.Replication is a very good reliable solution but the cost represented by the storageoverhead could be a drawback for certain account and configuration. Finally, ScalityARC with a 16,4 configuration adds just 20% of storage overhead, a exceptional low riskof data loss and high durability associated with the capability to tolerate multiple storagenodes failures. Scality Copyright 2012. Private and Confidential. 18
  20. 20. White Paper “Scality, Simply Scaling Storage” Figure 15: Scality ARC comparison with RAID and Replication configurations4.5 Consistency modelScality allows some tuning of the three dimensions of Brewer’s CAP theorem13(consistency, availability and partitioning tolerance). Scality RING can behave in "StrongConsistency" (SC) or "Eventually Consistency" (EC) mode. SC potentially reducesperformance, while EC provides better response time but a softer consistency,depending on the configured environment and operating constraints. Scality RINGenforces ACID (Atomicity, Consistency; Isolation and Durability) transactions, althoughadministrators can decide to ease some constraints based on their requirements. Thegoal is that a transaction leaves the Ring in a known stable state without breakingintegrity constraints. R+W>N Strong Consistency R + W <= N Eventually Consistency R : Number of replicas contacted to satisfy a Read operation W : Number of replicas that must send an acknowledget (ACK) to satisfy the Write operation N : Number of storage nodes for storing the replicas of the requested data Particular cases: • W >= 1: Writes are always possible • R >= 2: Minimal redundancy Figure 16: Illustration of Brewer’s theoryIn theory, data consistency with multiple copies can be enforced in the write (write ()) orread (read ()) operation. The traditional approach in the industry is to include thecoherency in the typically synchronous write function, with no real control in the readfunction. All data are identical based on the constraint in the write operation. Readoperations are equivalent and deliver similar value.A second approach makes writes fast and always possible, because the consistencycheck is placed in the read. However, for highly distributed systems or very large Scality Copyright 2012. Private and Confidential. 19
  21. 21. White Paper “Scality, Simply Scaling Storage”environments, a partitioning can exist with a very asynchronous write mode. To meet theneeds of read operations and also move towards the harmonization of different versionsof data inherent in a distributed system, Scality extends the Merkle13-17 tree algorithm forthe identification of different versions and their alignment with the number of copiesconstraint. The I/O engine always wants to access and see a consistent version of anobject before writing to it.Figure 17 illustrates the write() operation after key allocation with the native ability toexecute in parallel mode. Figure 17: Schema for a write operation from the connector4.6 ScalabilityScality delivers an enterprise storage infrastructure capable of handling large volumes ofdata. Scality Ring provides a solution that is highly scalable, offers high performanceand is available at a reasonable and affordable cost in comparison with traditionalstorage world rates.To achieve such results, Scality uses its own virtual storage technology of distributedobjects over standard, low cost servers. These include standard servers equipped withx86 processors, SSD, SAS or SATA disks, and multi-Gb Ethernet ports running on Linux.The result is a storage farm providing a cost-effective solution for any given performance,capacity and feature set.Scality delivers capacity rates 60% cheaper than the already attractive prices offered byAmazon S3. A recent Scality TCO study sanctions a storage rate for 1 PB at only 5cents/GB/month. The TCO of storage with Scality declines further with the growth ofstorage volume.It is therefore easier and less expensive to add capacity by simply adding storagecomponents to a Scality RING. A very high level of storage service can be maintainedsimply by adding simple servers that function as storage media or connectors, or byjoining several instances of Scality RING. The beauty of the architecture comes from itsscalability. The greater the number of servers, the better the global performance andavailability, and the cheaper and more efficient the cost of access becomes. Scality Copyright 2012. Private and Confidential. 20
  22. 22. White Paper “Scality, Simply Scaling Storage”Scality approach was recently studied and validated by ESG – Enterprise StrategyGroup – the independent consulting and business strategy firm. ESG conducted a deepanalysis and tests on Manageability, Flexibility, Resiliency and Object-basedPerformance. The following charts illustrate scalability capabilities of Scality RING. Formore information, the report is available on Scality web site at Figure 18: Performance Scalability in Objects/sec units Figure 19: Performance Scalability in Aggregate Throughput (MB/s) Scality Copyright 2012. Private and Confidential. 21
  23. 23. White Paper “Scality, Simply Scaling Storage”4.7 Management and SupportScality is easy to configure and manage with a command line interface (CLI) and a WebGUI. CLI named RingSH allows self-created scripts and can be integrated in amanagement framework as well. The second administration element named theSupervisor is very intuitive and offers many additional capabilities such as monitoringfrom the Ring to the individual disk, status of storage servers and nodes, capacitystatistics or alerts. The Supervisor allows a deep dive to discover details of storagenodes by key or server and provide the management capability to add or removeservers when needed. It could be a hardware refresh with recent new technologies or amaintenance task to replace defunct servers. During all these steps, the Ring continuesto serve requests without any impact and redistribute data among existing online serversand resources. This behavior demonstrates and reinforces the elastic characteristic ofScality RING. Figure 20: Scality RING dashboard with storage nodes and key space distributionScality provides 24x7 support and maintenance and delivers a variety of profresionalservices such as expert on site or dedicated care service. Scality Copyright 2012. Private and Confidential. 22
  24. 24. White Paper “Scality, Simply Scaling Storage”4.8 Partners and ecosystemThe target markets for Scality RING are the service providers and large enterprisesexperiencing relentless demand for performance, availability and scalability. Companieswanting to establish their own private or hybrid cloud are a perfect use case for theScality RING solution.Scality’s adoption in these markets is enabled by the availability of connectors that arecustomizable to specific requirements. Connectors provide the links and the translationsbetween applications and the Ring. The Ring integrates sophisticated functions thatallow the development of application services at a very high level.Connectors can also be developed independently of Scality, thanks to an available API,or they can be the object of cooperative development between the partner, theapplication users and Scality. The Scality Open Source Program (SCOP) is dedicated tothis type of partnership.Several types of connectors are available, from the generic to the specialized: Type Connectors Example Generic (for standard or - REST/http basic IT services) - File System (FUSE) - NAS (NFS) - CDMI (SNIA Standard) IT (for advanced IT - Email - Zimbra, Open-Xchange, services) Openwave, Critical Path, Dovecot, Cyrus - REST Storage Service aka RS2 - Backup - CommVault - ECM - Nuxeo - Enterprise NAS - Gluster Panzura, CTera Networks - Cloud Gateway - TwinStrata, StorSimple - Virtual Computing - Parallels - Cloud Desktop/Agent - Gladinet, Mezeo, TeamDrive, CloudBerry Lab, OxygenCloud Figure 21: Scality list of connectors and associated partners Scality Copyright 2012. Private and Confidential. 23
  25. 25. White Paper “Scality, Simply Scaling Storage”5 Conclusion: Scality, a new vision of storageScality proposes a new way of consuming storage without any inherent limits, all in avery economical fashion. The solution offers enterprise class features that overcome thecosts and constraints of traditional storage mechanisms. Whatever the application orusage, Scality RING Organic Storage can deliver a custom-aligned interface to storeand exchange data in a block or file or native object mode.The Scality solution is the right answer for demanding applications with high IOPS orguaranteed throughput requirements. Scality protects IT investments no matter whatsoftware and hardware are already in place. Best of all, Scality can evolve to takeadvantage of the latest advances in server technology and consumer pricing value.Internet companies faced these challenges for a decade before they developed andaddressed these new challenges on their own, and reach a leadership position notfeasible with commercial solutions at that time. Today, you’ll be able to deploy, use andrely on a similar storage datacenter running the Scality RING solution. You’ll be able todifferentiate your business from the competition with unprecedented agile behaviorusing a very cost-effective approach.Scality’s technology is available today, and represents the storage infrastructure of thefuture. The business value of Scality RING depends as much on its advantages inscalability, performance and availability as on its affordability. Scality Copyright 2012. Private and Confidential. 24
  26. 26. White Paper “Scality, Simply Scaling Storage”References1. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan - MIT Laboratory for Computer Science Multipurpose storage system based upon a distributed hashing mechanism with transactional support and failover capability US and WIPO patent WO/2010/080533 Vianney Rancurel, Oliver Lemarie, Giorgio Regni, Alain Tauch, Benoit Artuso, Jonathan Gramain Probabilistic offload engine for distributed hierarchical object storage US and WIPO patent WO/2011/072178 Giorgio Regni, Jonathan Gramain, Vianney Rancurel, Benoit Artuso, Bertrand Demiddelaer, Alain Tauch On Routing in Distributed Hash Tables Fabius Klemm, Sarunas Girdzijauskas, Jean-Yves Le Boudec, Karl Aberer - School of Computer and Communication Sciences - Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing Fabius Klemm, Jean-Yves Le Boudec, Dejan Kosti´c, Karl Aberer - School of Computer and Communication Sciences - Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland us/um/redmond/events/iptps2007/papers/klemmleboudeckosticaberer.pdf6. An Architecture for Peer-to-Peer Information Retrieval Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu - School of Computer and Communication Sciences - EPFL, Lausanne, Switzerland A High-Performance Distributed Hash Table for Peer-to-Peer Information Retrieval Thèse #4012 (2008) – Fabius Klemm – EPFL Scality Copyright 2012. Private and Confidential. 25
  27. 27. White Paper “Scality, Simply Scaling Storage”8. Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels - Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber - Google The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung - Google Computing in the RAIN: A Reliable Array of Independent Nodes Vasken Bohossian, Charles C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu & Jehoshua Bruck – California Institute of Technology Time, Clocks, and the Ordering of Events in Distributed Systems L. Lamport Comm. ACM 21, 1978, pp. 558-565 Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Seth Gilbert & Nancy Lynch Providing Authentication and Integrity in Outsourced Databases using Merkle Hash Trees Einar Mykletun, Maithili Narasimha & Gene Tsudik - University of California Irvine Secrecy, authentication, and public key systems R. Merkle, Ph.D. dissertation, Dept. of Electrical Engineering - Stanford University, 1979 Fractal Merkle Tree Representation and Traversal M. Jakobsson, T. Leighton, S. Micali and M. Szydlo - RSA-CT ‘03 df&ei=dEdGT5aSPOjC0QWI3siJDg&usg=AFQjCNESUo- gXFi7gIDSD0H5zfO60UiTmw Scality Copyright 2012. Private and Confidential. 26
  28. 28. White Paper “Scality, Simply Scaling Storage”17. Implementation of a Hash-based Digital Signature Scheme using Fractal Merkle Tree Representation D. Coluccio Merkle Tree Traversal in Log Space and Time M. Szydlo - Eurocrypt 04 An Analysis of Latent Sector Errors in Disk Drives Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy and Jiri Schindler Failure Trends in a Large Disk Drive Population Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr’e Barroso – Google Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder and Garth A. Gibson Scality Copyright 2012. Private and Confidential. 27
  29. 29. White Paper “Scality, Simply Scaling Storage”GlossaryFeatures BenefitsObject Storage The Scality RING Organic Storage is consumed through an object interface, independent of classic file and block protocols. This object layer enables better scalability and independence with application and hardware components. All the logic is controlled by the software layer developed by Scality. For integration in legacy environments, Scality RING also exposes file and block interfaces through its connectors.2nd generation Scality employs a logically distributed and structured organization based on DHT* withoutP2P any central intelligence, catalog or hierarchy across servers. It provides an SLA guarantee with linear and near-direct key-node mappings. Scality delivers comprehensive keyspace coverage through consistent hashing based on the Chord protocol.Key/Value Scality offers linear performance at scale with its key/value approach, which is simple byStore design but powerful in its delivery result. The generated key, with its 160 bits, offers implicit meaning for object placement, redundancy and service levels.No There is no need to segment or split the data to use Scality RING. The immediate gain isdata sharding simplicity, flexibility and durability of data during its entire lifetime.Application Applications access their data through connectors that translate the application dataintegration representation to the Scality internal model. There are multiple generic connectors including http/REST, RS2, file or block based accessors.More sophisticated connectors have been developed for mail system integration with Zimbra, Openwave, CriticalPath, Dovecot, Open-Xchange and Cyrus. Additional connectors are available for: • Mezeo, CloudBerry Lab, Gladinet, OxygenCloud, TeamDrive (Cloud), • Parallels (Virtualization), • Gluster/RedHat, Panzura, CTera Networks (NAS), • TwinStrata, StorSimple (iSCSI Storage Gateway), • CommVault (Backup and Archive), • and Nuexo (ECM), • + SNIA CDMI** support (server and client) Scality has a partner program, SCOP, to enable fast connector development via the Scality open API.Object and There is no maximum number or size limitation for objects, files, file systems, databasesentity number or tables stored. The object key size is 20 bytes with an identification key on 128 bits.and size Objects can also have multiple sizes during their lifetimes.Self-Healing Scality RING implements a comprehensive, fully automated mechanism to detect and fix failures at object, storage or node levels to maintain SLA. The integrity of each object is protected by a checksum method to compute and generate new copies.Auto-Tiering Scality allows the object hierachization within a RING or across multiple RINGs. It implies a better SLA alignement with migration of inactive data at the secondary level and the presence of active data on primary storage.Load Balancing Scality fully automates load balancing of the keyspaces within the storage infrastructure (across storage nodes). The load balancing makes sure that objects and metadata are Scality Copyright 2012. Private and Confidential. 28
  30. 30. White Paper “Scality, Simply Scaling Storage” evenly distributed regardless of root event (loss or addition of nodes, configuration changes, etc.) or failure.Data Up to 5 replicas (6 copies) per object can be set. This number could be different amongRedundancy stored objects. All configuration parameters are controlled in the administration console running on the supervisor node. The copies are managed by the connectors themselves.Access Multiple connectors in stateless mode are configured to maintain access to the dataRedundancy infrastructure. Multiple copies and parallel access to different storage nodes provide continuous services to applications.Advanced ARC is the Erasure-code implementation made by Scality to reduce number of dataResilience copies and increase resiliency of the data globally. The default configuration is (16,4)Configuration meaning 16 data fragments are stored plus 4 checksum fragments. It provides direct data(ARC) access without additional computation and up to 4 components failures per request (disks, servers, networks…).Management The control and administration of the platform can be managed from the supervisor node running the dashboard GUI. It is also possible to operate a command line interface (CLI) via RingSH, or by running a script with all desired options.Hardware Scality RING allows a complete independent selection of hardware material without anyagnostic constraints on CPU, memory, network or disk type. Scality leverages commodities and runs on standard Linux distributions (CentOS, Ubuntu and Debian). Recommendations and guidelines are available to establish the right server farm for specific applications.No limit to There is no limit in terms of number or mixture of servers. Pick your own brand andhardware model, then build your storage platform.Node flavor There are three kinds of logical system within a Scality infrastructure: • Connector node: gateway to access the RING and transport data in or out; very important in the key generation process and for object redundancy mechanisms. • Supervisor node: specific role in the management and the configuration of the platform. • Storage node: virtual storage server running on a physical storage server. Multiple instances run at the same time to enable data dispersion, parallel access and redundancy.*DHT: Distributed Hash Tables**CDMI: SNIA Cloud Data Management Interface***RAIN: Reliable | Random | Redundant Array of Inexpensive | Independent Nodes Scality Copyright 2012. Private and Confidential. 29