Randy Bias, Co-Founder and CTO of Cloudscaling, speaks on open storage, fault tolerance and the concept of failure "blast radius" at the Open Storage Summit, hosted by Nexenta in May 2012.
1. Cloud Storage Futures (previously: Designing Private & Public Clouds)
May 22nd, 2012
Randy Bias, CTO & Co-founder
@randybias
CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution*
* All unlicensed or borrowed works retain their original licenses
17. Scale-out Principles
• Small failure domains
• Risk acceptance vs. risk mitigation
• More boxes for throughput & redundancy
• Assume app manages complexity:
• Data replication
• Assumes infrastructure is unreliable:
• Server & data redundancy
• Geo-distribution
• Auto-scaling
11
18. What’s a failure domain?
• “Blast radius” during a failure
• What is impacted?
• Public SAN failures:
• FlexiScale SAN failure in 2007
• UOL Brazil in 2011:
• http://goo.gl/8ct9n
• There are many more
• Enterprise HA ‘pairs’ typically support
BIG failure domains
12
19. Two Diff Arches for Two Kinds of Apps
App
Elastic Scale-out
Location of complexity
Cloud
“Enterprise”
Cloud
Infra
Primary scaling dimension
Up Out
13
21. Two Diff Storages for Two Kinds of Clouds
“Classic” “Scale-out”
Storage Storage
Uptime in Infra Uptime in apps
Every part is redundant Minimal h/w redundancy
Data mgmt in Infra Data mgmt in apps
Bigger SAN/NAS/DFS Smaller failure domains
15
22. Difference in Tiers
Tier $ Purpose Classic Scale-out
Mission
•SAN, then NAS •On-demand SAN (EBS)
1 $$$$
Critical
•10-15K RPM •DynamoDB (AWS)
•SSD •Variable service levels
•NAS then SAN •DAS
2 $$ Important
•7.2K RPM •App / DFS to scale out
Archive & •Tape
3 $
Backups •Nearline 5.4K •Object Storage
16
23. The Biggest Difference is in Where Data
Management Resides
• In scale-out systems, apps are
managing the data:
• Riak / Scale-out distributed data
store
• Hadoop+HDFS / Scale-out
distributed computation systems
• Cassandra / Scale-out distributed
columnar database
17
24. Cassandra / Netflix use case
• 3 x Replication
• Linearly scaling performance
• 50 - 300 nodes
• > 1M writes/second
• When is this perfect?
• data size unknown
• growth unknown
• lots of elastic dynamism
18
25. Cassandra / Netflix use case
• DAS (‘ephemeral store’)
• Per node perf is constant
• disk
• CPU
• network
• Client write times constant
• Nothing special here
19
26. Cassandra / Netflix use case
• On-demand & app-managed
• Cost per GB/hr: $.006
• Cost per GB/mo: $4.14
• Includes: storage, DB, storage
admin, network, network
admin, etc. etc. etc.
20
29. There are a few basic approaches being
taken ...
23
30. Scale-out SAN
• In-rack SAN == faster, bigger
DAS w/ better stat-muxing
• Accept normal DAS failure rates
• Assume app handles data
replication Dedicated Storage SW
• Like AWS ‘ephemeral storage’ 9K Jumbo Frames
SSD caches (ZIL/L2ARC)
• KT architecture No replication
• Customers didn’t “get it” Max HW redundancy
• “Ephemeral SAN” not well
understood
24
31. AWS EBS - “Block-devices-as-a-Service”
API
Cloud Control System
EBS
Scheduler
• Scale-out SAN (sort of)
• Block scheduler EBS
Clusters
• Async replication Core Network
1 1
• Some failure tolerance 2
• Scheduler: 2 3
• Allocates customer block 4
Inter-rack
cluster async
devices across many VM1 N1 1 3
3 5 replication
failure domains
RACK
RACK
6
• Customer run RAID inside 7
Intra-rack
cluster async
VMs to increase VM2 N2 2 8
replication
redundancy
25
32. DAS + Big Data
(Storage + Compute + DFS)
• Storage capability:
• Replication
• Disk & server failure
• Data rebalancing
• Data locality
• rack awareness
• Checksums (basic)
• Also:
• Built in computation
26
33. Distributed File Systems (DFS) over DAS
• Storage capability:
• Replication
• Disk & server failure
• Data rebalancing
• Checksums (w/ btrfs)
• Block devices
• Also: ✴
ceph architecture
• No computation
✴ Looks familiar doesn’t it? 27
34. Why is DFS at the Physical Layer
Dangerous for Scale-out?
DFS
==
28
36. Object Storage
• Storage capability:
• Replication
• Disk & server failure
• Data rebalancing
• Checksums (sometimes)
• Also:
• Looks like a big web app
• Uses a DHT/CHT to ‘index’ blobs
• Very simple
30
37. Where does OpenStorage Fit?
Scale-out Virtual or
Purpose / Tier Fit
Solution Physical?
Scale-out SAN Tier-1/2 Physical In-rack SAN
EBS Clusters
EBS Tier-1 Physical
(scale-out SAN)
Reliable, bit-rot
DAS+BigData Tier-2 Virtual
resistant DAS
Reliable, bit-rot
DAS+DFS Tier-2 Physical / Virtual resistant DAS
(unproven)
In-VM reliable
DAS+DB Tier-2 Virtual
DAS
Reliable, bit-rot
Object Storage Tier-3 Physical
resistant DAS
31
38. Summarizing ZFS Value in Scale-out
• Data integrity & bit rot an issue that few solve today
• Most SAN/NAS solutions don’t ‘scale down’
• Commodity x86 servers are winning
• There are two scale-out places ZFS wins:
• Small SAN clusters
• Best DAS management
32
40. Conclusions / Speculations
• Build the right cloud
• Which means the right storage for *that* cloud
• A single cloud might support both ...
• Open storage can be used for both ...
• ... WITH the appropriate design/forethought
34