This document discusses experiments conducted to determine the optimal hardware and software configurations for building a cost-efficient Swift object storage cluster with expected performance. It describes testing different configurations for proxy and storage nodes under small and large object upload workloads. The results show that for small object uploads, high-CPU instances performed best for storage nodes while either high-CPU or high-end instances worked well for proxies. For large object uploads, large instances were most cost-effective for storage nodes and high-end instances remained suitable for proxies. The findings provide guidance on right-sizing hardware based on workload characteristics.
Deep Dive into RDS PostgreSQL UniverseJignesh Shah
In this session we will deep dive into the exciting features of Amazon RDS for PostgreSQL, including new versions of PostgreSQL releases, new extensions, larger instances. We will also show benchmarks of new RDS instance types, and their value proposition. We will also look at how high availability and read scaling works on RDS PostgreSQL. We will also explore lessons we have learned managing a large fleet of PostgreSQL instances, including important tunables.
My experience with embedding PostgreSQLJignesh Shah
At my current company, we embed PostgreSQL based technologies in various applications shipped as shrink-wrapped software. In this session we talk about the experience of embedding PostgreSQL where it is not directly exposed to end-user and the issues encountered on how they were resolved.
We will talk about business reasons,technical architecture of deployments, upgrades, security processes on how to work with embedded PostgreSQL databases.
Deep Dive into RDS PostgreSQL UniverseJignesh Shah
In this session we will deep dive into the exciting features of Amazon RDS for PostgreSQL, including new versions of PostgreSQL releases, new extensions, larger instances. We will also show benchmarks of new RDS instance types, and their value proposition. We will also look at how high availability and read scaling works on RDS PostgreSQL. We will also explore lessons we have learned managing a large fleet of PostgreSQL instances, including important tunables.
My experience with embedding PostgreSQLJignesh Shah
At my current company, we embed PostgreSQL based technologies in various applications shipped as shrink-wrapped software. In this session we talk about the experience of embedding PostgreSQL where it is not directly exposed to end-user and the issues encountered on how they were resolved.
We will talk about business reasons,technical architecture of deployments, upgrades, security processes on how to work with embedded PostgreSQL databases.
PostgreSQL is designed to be easily extensible. For this reason, extensions loaded into the database can function just like features that are built in. In this session, we will learn more about PostgreSQL extension framework, how are they built, look at some popular extensions, management of these extensions in your deployments.
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
"DTracing the Cloud", Brendan Gregg, illumosday 2012
Cloud computing facilitates rapid deployment and scaling, often pushing high load at applications under continual development. DTrace allows immediate analysis of issues on live production systems even in these demanding environments – no need to restart or run a special debug kernel.
For the illumos kernel, DTrace has been enhanced to support cloud computing, providing more observation capabilities to zones as used by Joyent SmartMachine customers. DTrace is also frequently used by the cloud operators to analyze systems and verify performance isolation of tenants.
This talk covers DTrace in the illumos-based cloud, showing examples of real-world performance wins.
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. This presentation will describe the features and implementation of HDFS Federation scheduled for release with Hadoop-0.23.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
PostgreSQL is designed to be easily extensible. For this reason, extensions loaded into the database can function just like features that are built in. In this session, we will learn more about PostgreSQL extension framework, how are they built, look at some popular extensions, management of these extensions in your deployments.
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
"DTracing the Cloud", Brendan Gregg, illumosday 2012
Cloud computing facilitates rapid deployment and scaling, often pushing high load at applications under continual development. DTrace allows immediate analysis of issues on live production systems even in these demanding environments – no need to restart or run a special debug kernel.
For the illumos kernel, DTrace has been enhanced to support cloud computing, providing more observation capabilities to zones as used by Joyent SmartMachine customers. DTrace is also frequently used by the cloud operators to analyze systems and verify performance isolation of tenants.
This talk covers DTrace in the illumos-based cloud, showing examples of real-world performance wins.
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. This presentation will describe the features and implementation of HDFS Federation scheduled for release with Hadoop-0.23.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
Après la petite intro sur le stockage distribué et la description de Ceph, Jian Zhang réalise dans cette présentation quelques benchmarks intéressants : tests séquentiels, tests random et surtout comparaison des résultats avant et après optimisations. Les paramètres de configuration touchés et optimisations (Large page numbers, Omap data sur un disque séparé, ...) apportent au minimum 2x de perf en plus.
Sanger OpenStack presentation March 2017Dave Holland
A description of the Sanger Institute's journey with OpenStack to date, covering RHOSP, Ceph, S3, user applications, and future plans. Given at the Sanger Institute's OpenStack Day.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Amazon EC2 provides a broad selection of instance types to deliver high performance for a diverse mix of applications. In this session, we overview the drivers of system performance and discuss in depth how Amazon EC2 instances deliver system performance while also providing elasticity and complete control over your infrastructure. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Have you heard that all in-memory databases are equally fast but unreliable, inconsistent and expensive? This session highlights in-memory technology that busts all those myths.
Redis, the fastest database on the planet, is not a simply in-memory key-value data-store; but rather a rich in-memory data-structure engine that serves the world’s most popular apps. Redis Labs’ unique clustering technology enables Redis to be highly reliable, keeping every data byte intact despite hundreds of cloud instance failures and dozens of complete data-center outages. It delivers full CP system characteristics at high performance. And with the latest Redis on Flash technology, Redis Labs achieves close to in-memory performance at 70% lower operational costs. Learn about the best uses of in-memory computing to accelerate everyday applications such as high volume transactions, real time analytics, IoT data ingestion and more.
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
This slides were presented at the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15)
Performance isolation is emerging as a requirement for High Performance Computing (HPC) applications, particularly as HPC architectures turn to in situ data processing and application composition techniques to increase system throughput. These approaches require the co-location of disparate workloads on the same compute node, each with different resource and runtime requirements. In this paper we claim that these workloads cannot be effectively managed by a single Operating System/Runtime (OS/R). Therefore, we present Pisces, a system software architecture that enables the co-existence of multiple independent and fully isolated OS/Rs, or enclaves, that can be customized to address the disparate requirements of next generation HPC workloads. Each enclave consists of a specialized lightweight OS co-kernel and runtime, which is capable of independently managing partitions of dynamically assigned hardware resources. Contrary to other co-kernel approaches, in this work we consider performance isolation to be a primary requirement and present a novel co-kernel architecture to achieve this goal. We further present a set of design requirements necessary to ensure performance isolation, including: (1) elimination of cross OS dependencies, (2) internalized management of I/O, (3) limiting cross enclave communication to explicit shared memory channels, and (4) using virtualization techniques to provide missing OS features. The implementation of the Pisces co-kernel architecture is based on the Kitten Lightweight Kernel and Palacios Virtual Machine Monitor, two system software architectures designed specifically for HPC systems. Finally we will show that lightweight isolated co-kernels can provide better performance for HPC applications, and that isolated virtual machines are even capable of outperforming native environments in the presence of competing workloads.
Optimizing elastic search on google compute engineBhuvaneshwaran R
If you are running the elastic search clusters on the GCE, then we need to take a look at the Capacity planning, OS level and Elasticsearch level optimization. I have presented this at GDG Delhi on Feb 22,2020.
In this session, Boyan Krosnov, CPO of StorPool will discuss a private cloud setup with KVM achieving 1M IOPS per hyper-converged (storage+compute) node. We will answer the question: What is the optimum architecture and configuration for performance and efficiency?
View the performance metrics that turned the heads of VMware, EMC, and NetApp at VMworld 2011.
See the reason why Nexenta is now the single biggest threat to legacy storage.
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services
Amazon Elastic Compute Cloud (Amazon EC2) provides a broad selection of instance types to accommodate a diverse mix of workloads. In this technical session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, and Memory Optimized families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Learning Objectives: • Understand the differences between instances • Learn best practices and tips for getting the most out of EC2 instances
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
Ceph is a mature open source software-defined storage solution that was created over a decade ago.
During that time new faster storage technologies have emerged including NVMe and Persistent memory.
The crimson project aim is to create a better Ceph OSD that is more well suited to those faster devices. The crimson OSD is built on the Seastar C++ framework and can leverage these devices by minimizing latency, cpu overhead, and cross-core communication. This talk will discuss the project design, our current status, and our future plans.
In this webinar, we will review all important information for sponsors packages, add-ons, venue details, and how to become a sponsor.
Webinar recording: https://youtu.be/kUjMTNoX6yM
A few quick points for those who may be attending an OpenStack Summit for the first time. We are excited to see you in Barcelona, Spain October 25-28, 2016.
An overview of the 1H2016 OpenStack Marketing Plan shared with the marketing community during our regular calls. Learn more at https://wiki.openstack.org/wiki/Governance/Foundation/Marketing#Open_Marketing_Meetings_2016
The Foundation marketing team put together a high level overview of 2H 2015 plans in order to get input from the marketing community and provide more information on how marketers can take advantage of the work, as well as get involved and contribute.
This is a content overview of the important information and details for sponsors of the upcoming OpenStack Summit in Tokyo, Japan taking place October 27 - 30.
You can watch a recording of the webinar here: https://openstack.webex.com/openstack/ldr.php?RCID=d48605b7ca9fdccd990ab20eb9334be8
OpenStack celebrates its fifth birthday, July 19, 2015, and this presentation provides an update on the community momentum, as well as what's next. #openstack5bday
At OpenStack Day CEE 2015, we discuss the latest user survey results, some real-world OpenStack case studies and how new users and cloud operators can get involved with the community.
1. How swift is your Swift?
Ning Zhang, OpenStack Engineer at Zmanda
Chander Kant, CEO at Zmanda
1
2. Outline
Build a cost-efficient Swift cluster with expected performance
Background & Problem
Solution
Experiments
When something goes wrong in a Swift cluster
Two Types of Failures: Hard Drive, Entire Node
What is performance degradation before the failures are fixed
How soon the data will be back (when all failed nodes are back on-line)?
Experiments
2
3. Zmanda
Leader in Open Source Backup and Cloud Backup
We got strong interest in integrating our cloud backup
products with OpenStack Swift
Backup to OpenStack Swift
Alternative to tape based backups
Swift Installation and Configuration Services
3
4. Background
Public Storage Cloud
Pros: pay-as-you-go, low upfront cost …
Cons: expensive in the long run, performance is not clear …
Private Storage Cloud (use case: backup data to private cloud by Zmanda
products)
Pros: low TCO in the long run, expected performance, in-house data …
Cons: high upfront cost, long ramp-up period (prepare and tune HW & SW)
Open Problem / Challenge:
How to build a private cloud storage with ….
Low upfront cost, expected performance, short ramp-up period
4
5. Background
Swift is an open-source object store running on commodity HW
High scalability (linear scale-out as needed)
High availability (3 copies of data)
High durability
Swift has heterogeneous types of nodes
Proxy – Swift’s brain (coordinate Storage – Swift’s
requests, handle failure…) warehouse (store objects)
5
6. Problem
How to provision the proxy and storage nodes in a Swift cluster for
expected performance (SLA) while keeping low upfront cost?
Hardware: CPU, memory
network, I/O device …
Software: filesystem,
Swift configuration …
……
6
7. Lesson Learnt from Past
CPU, network I/O intensive Disk I/O intensive XFS filesystem
High-end CPU, 10 GE networking Commodity CPU, 1 GE networking
Are they always true in all cases? especially for different workloads?
always pay off to choose 10GE (expensive!) for proxy nodes?
always sufficient to use commodity CPU for storage nodes?
always disk I/O intensive on storage nodes?
how much difference in performance between XFS and other FS?
….
7
8. Solution
Solution (similar to “Divide and Conquer” strategy)
First, solve the problem in a small Swift cluster (e.g. 2 proxy nodes, 5-15 storage nodes)
1: For each HW configuration for proxy node
Pruning methods make it simple!
2: For each HW configuration for storage node
3: For each number of storage node from 5, 10, 15…
4: For each SW parameter setting Exhaustive search?
5: A small Swift cluster is made, measure its performance, calculate and save its “perf/cost”
6: Recommend the small Swift clusters with high “performance/cost”
Then, scale out recommended small Swift clusters to large Swift clusters until SLA is met
Performance and cost also get scaled (when networking is not a bottleneck)
Their HW & SW settings (identified in small clusters) are also held true for large clusters
8
9. Evaluation - Hardware
Hardware configuration for proxy and storage nodes
Amazon EC2 (diverse HW resources, no front cost, virtualized HW -> physical HW)
Two hardware choices for proxy node:
# 1: Cluster Compute Extra Large Instance (EC2 Cluster)
# 2: High-CPU Extra Large Instance (EC2 High-CPU)
Two hardware choices for storage node:
# 1: High-CPU
# 2: Large Instance (EC2 Large)
Cluster High-CPU Large
CPU speed 33.5 EC2 Compute 20 EC2 Compute 4 EC2 Compute
Units Units Units
Memory 23 GB 7GB 7.5GB
Network 10 GE 1 GE 1 GE
Pricing (US East) $ 1.30/h $ 0.66/h $ 0.32/h
9
10. Evaluation – Cost & Software
Upfront Cost
EC2 cost ($/hour)
EC2 cost≠ physical HW cost, but it is a good implication of physical HW cost
Software Configuration
Filesystem
XFS (recommended by RackSpace)
Ext4 (popular FS, but not evaluated for Swift)
Swift Configuration Files
db_preallocate (it is suggested to set True for HDD to reduce defragmentation)
OS settings
disable TIME_WAIT, disable syn cookies …
will discuss in our future blog …
10
11. Evaluation – Workloads
2 Sample Workloads
Upload (GET 5%, PUT 90%, DEL: 5%)
Small Objects Example: Online gaming hosting service
(object size 1KB – 100 KB) the game sessions are periodically saved as small files.
Large Objects Example: Enterprise Backup
(object size 1MB – 10 MB) the files are compressed into large trunk to backup.
Occasionally, recovery and delete operations are needed.
Object sizes are randomly and uniformly chosen within the pre-defined range
Objects are continuously uploaded to the test Swift clusters
COSBench – a cloud storage benchmark tool from
Free to define your own workloads in COSBench !
11
12. Evaluation – Upload small objects
Top-3 recommended hardware for a small Swift cluster
HW for proxy node HW for storage node Throughput/$
Upload Small 1 2 proxy nodes (High-CPU) 5 storage nodes (High-CPU) 151
Objects 2 2 proxy nodes (Cluster) 10 storage nodes (High-CPU) 135
3 2 proxies nodes (Cluster) 5 storage nodes (High-CPU) 123
Storage node is all based on High-CPU
CPU are intensively used for handling large # requests. CPU is the key resources.
Comparing to Large Instance (4 EC2 Compute Units, $0.32/h)
High-CPU Instance has 20 EC2 Compute Units with $0.66/h (5X more CPU resources, only 2X expensive)
Proxy node
Traffic pattern: high throughput, low network bandwidth (e.g. 1250 op/s -> 61MB/s)
10 GE from Cluster Instance is over-provisioned for this traffic pattern
Comparing to High-CPU, Cluster has 1.67X CPU resources, but 2X expensive
Besides, 5 storage nodes can almost saturate 2 proxy nodes
12
13. Evaluation – Upload large objects
Top-3 recommended hardware for a small Swift cluster
HW for proxy node HW for storage node Throughput/$
Upload Large 1 2 proxy nodes (Cluster) 10 storage nodes (Large) 5.6
Objects 2 2 proxy nodes (High-CPU) 5 storage nodes (Large) 4.9
3 2 proxy nodes (Cluster) 5 storage nodes (Large) 4.7
Storage node is all based on Large
More time is spent on transferring objects to I/O devices. Write request rate is low, CPU is not the key factor.
Comparing to High-CPU Instance (20 EC2 Compute Units, $0.66/h),
Large Instance has 4 EC2 Compute Units (sufficient) with $0.32/h (2X cheaper).
Proxy node
Traffic pattern: low throughput, high network bandwidth
e.g. 32 op/s -> 160 MB/s for incoming and ~500 MB/s for outgoing traffic (write in triplicate!)
1 GE from High-CPU is under-provisioned, 10 GE from Cluster is paid off for this workload.
Need 10 storage nodes to keep up with the 2 proxy nodes (10 GE)
13
14. Evaluation – Conclusion for HW
Take-away points for provisioning HW for a Swift cluster
Hardware for proxy node Hardware for storage node
Upload Small Object 1 GE 1 GE
High-end CPU High-end CPU
Upload Large Object 10 GE 1 GE
High-end CPU Commodity CPU
Download workloads: see the backup slides
Contrary to the lessons learnt from the past
It does NOT always pay off to choose 10 GE (expensive!) for proxy nodes
It is NOT always sufficient to use commodity CPU for storage nodes
Upload is disk I/O intensive (3 copies of data)
but download is NOT always disk I/O intensive (retrieve one copy of data)
14
15. Evaluation – Conclusion for SW
Take-away points for provisioning SW for a Swift cluster
db_preallocate XFS vs. Ext4
Upload Small Objects on XFS
Upload Large Objects on / off XFS / Ext4
Upload Small Object (more sensitive to software settings)
db_preallocation: intensive updates on container DB. Setting it to on will gain 10-20% better performance
Filesystem: we observe XFS achieves 15-20% extra performance than Ext4
15
16. Evaluation – Scale out small cluster
Workload #1: upload small objects (same workload for exploring HW & SW configurations for small Swift cluster)
Based on the top-3 recommended small Swift clusters
20
2X 2 Proxy Node (High-CPU)
18 3X 5 Storage Node (High-CPU)
16
Upfront Cost ($/hour)
3X: 6 proxy 2 Proxy Node (Cluster)
14 1.5X
15 storage 10 Storage Node (High-CPU)
12 Lowest cost 2X
Lowest cost 2 Proxy Node (Cluster)
10
5 Storage Node (High-CPU)
8
1X
SLA 1: 80% RT < 600ms 2X: 4 proxy 1X
6 10 storage SLA 2: 80% RT < 1000ms
4
1X: 2 proxy
2 5 storage
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
16 Response Time (ms) of 80% requests
17. Outline
Build a cost-efficient Swift cluster with expected performance
Background & Problem
Solution
Experiments
When something goes wrong in a Swift cluster
Two Types of Failures: Hard Drive, Entire Node
What is performance degradation before the failures are fixed
How soon the data will be back (when all failed nodes are back on-line)?
Experiments
17
18. Why Consider Failures
Failure stats in Google’s DC (from Google fellow Jeff Dean’s interview at 2008)
A cluster of 1,800 servers in its first year……
Totally, 1,000 servers failed, thousands of HDDs failed
1 power distribution unit failed, bringing down 500 – 1,000 machines for 6 hours
20 racks failed, each time causing 40 – 80 machines to vanish from network
Failures in Swift
Given a 5-zone setup, Swift can tolerate at most 2 zones failed (data will not be lost)
But, performance will degrade to some extent before the failed zones are fixed.
If Swift operators want to ensure certain performance level
They need to benchmark the performance of their Swift clusters upfront
18
19. How Complex to Consider Failure
(1) Possible failure at one node
Disk
Swift process (rsync is still working)
Entire node
(2) Which type of node failed
Proxy
Storage
(3) How many nodes failed at same time
Combining above three considerations, the total space of all failure scenarios is huge
practical to prioritize those failure scenarios
E.g. the worst or more common scenarios are considered first
19
20. Evaluation - Setup
Focus on performance (not data availability)
Measure performance degradation comparing to “no failure” case, before failed nodes back on-line
Workload: Backup workload (uploading large objects is the major operation)
Swift cluster: 2 proxy nodes (Cluster: Xeon CPU, 10 GE), 10 storage nodes
Two common failure scenarios: (1) entire storage node failure (2) HDD failure in storage node
(1) Entire storage node failure
10%, 20%, 30% and 40% storage nodes failed in a cluster (E.g. partial power outage)
Different HW resources are provisioned for storage node
EC2 Large for storage node (cost-efficient, high performance/cost)
EC2 High-CPU for storage node (costly, over-provisioned for CPU resources)
20
21. Evaluation - Setup
(2) HDD failure in storage node (EC2 Large for storage node)
Each storage node attaches 8 HDDs
Intentionally umount some HDDs during the execution.
Storage node is still accessible
10%, 20%, 30% and 40% of HDDs failed in a cluster
Compare two failure distributions:
Uniform HDD failure (failed HDDs uniformly distributed over all storage nodes)
Skewed HDD failure (some storage nodes get much more # HDDs failed than other
nodes)
21
22. Evaluation – Entire Node Failure
Cluster throughput (# operations / second)
40
35
30
25
20 Large
15 High-CPU
10
5
0
no failure 10% failure 20% failure 30% failure 40% failure
Storage node based on Large Instance
Throughput decreases as more storage failed
Storage node based on High-CPU Instance
Throughput decreases only when 40% nodes fail
22
23. Evaluation – Entire Node Failure
CPU usage in unaffected storage node Network bandwidth (MB/s) in unaffected storage node
100% 80
80% 70
60
60% 50
Large 40
40% Large
30
High-CPU High-CPU
20% 20
10
0%
0
no failure 10% 20% 30% 40%
no failure 10% 20% 30% 40%
failure failure failure failure
failure failure failure failure
When storage node is based on High-CPU Instance
Over-provisioned resources in unaffected node get more used as # failures increases
So, it can keep performance from degrading initially
When storage node is based on Large Instance
CPU is almost saturated when no failure happens
23
24. Evaluation – HDD Failure
Cluster throughput (# operations / second) Usage of unaffected disk (%)
40 when HDDs are uniformly failed
35 30
30 25
25
20
20
15
15
10 10
5 5
0 0
no failure uniform uniform uniform uniform skewed 40% node uniform 10% uniform 20% uniform 30% uniform 40%
10% HDD 20% HDD 30% HDD 40% HDD 40% HDD failure HDD failure HDD failure HDD failure HDD failure
failure failure failure failure failure
When HDD are uniformly failed across all storage nodes When HDDs are uniformly failed
Throughput does not decrease ! Why? I/O loads are evenly distributed over other
unaffected HDDs
When some storage nodes have more failed HDDs than others (skewed)
Throughput decreases significantly, still better than entire node failure
Extreme case: when all HDDs on a storage node fail, it is almost equal to entire node failure
24
25. Evaluation – Take-away points
In order to maintain certain performance in fact of failure
Make sense to “over-provision” the HW resources to some extent
When failure happens, the “over-provisioned” resources will reduce the performance degradation
Entire storage node failure vs. HDD failure
Entire node failure is the worse than HDD failure.
When only HDDs failed, performance degradation depends on:
If failed HDDs are uniformly distributed across all storage nodes
degradation is smaller, because I/O load can be rebalanced over unaffected HDDs
Otherwise (failure distribution is skewed)
degradation may be larger
What if proxy node failure? proxy and storage nodes fail together?
Reduce the performance, need to figure out in the future
25
26. When Failed Nodes Are Fixed
When all failed (affected) nodes have been fixed and re-join the Swift cluster
(1) How soon the recovery will take on the affected nodes?
(2) What is performance when the recovery is undergoing?
We will show empirical results in our blog (http://www.zmanda.com/blogs/)
For (1), it depends on:
How much data need to be recovered.
Networking latency b/w unaffected and affected nodes
HW resources (e.g. CPU) in unaffected nodes (lookup which data need to be restored)
26
27. When Failed Nodes Are Fixed
When all failed nodes have been fixed and re-join the Swift cluster
(1) How soon the recovery will take on the affected nodes?
(2) What is performance when the recovery is undergoing?
For (2), it depends on:
HW resources in unaffected nodes. The unaffected nodes become more resource-
intensive because they still serve requests, also help affected nodes to restore their data
Performance will gradually increase as the recovery progress is close to 100%
27
30. Evaluation – Download small objects
Top-3 recommended hardware for a small Swift cluster
HW for proxy node HW for storage node
Download Small 1 2 proxy nodes (High-CPU) 5 storage nodes (Large)
Objects 2 2 proxy nodes (Cluster) 5 storage nodes (Large)
3 2 proxies nodes (High-CPU) 10 storage nodes (Large)
Storage node is all based on Large
Only one copy of data is retrieved. CPU and disk I/O are not busy
Large is sufficient for workload and saves more cost than High-CPU
Proxy node
Traffic pattern: high throughput, low network bandwidth (e.g. 2400 op/s -> 117 MB/s)
10 GE from Cluster is over-provisioned for this traffic pattern
1 GE from High-CPU is adequate
5 storage nodes can almost saturate the 2 proxy nodes.
30
31. Evaluation – Download large objects
Top-3 recommended hardware for a small Swift cluster
HW for proxy node HW for storage node
Download Large 1 2 proxy nodes (Cluster) 5 storage nodes (Large)
Objects 2 2 proxy nodes (Cluster) 10 storage nodes (Large)
3 2 proxy nodes (High-CPU) 5 storage nodes (Large)
Storage node is all based on Large
Request rate is low, little load on CPU.
Large Instance is sufficient for workload and saves more cost than High-CPU.
Proxy node
Traffic pattern: low throughput, high network bandwidth (70 op/s -> 350 MB/s)
1 GE from High-CPU is under-provisioned, 10 GE from Cluster is paid off for this workload.
5 storage nodes can nearly saturate the 2 proxy nodes.
31