Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Flat Datacenter Storage
Presented by Alex Rasmussen
Papers We Love SF #11
2015-01-22
Edmund B. Nightingale, Jeremy Elson, ...
@alexras
Sort Really Fast
THEMIS
MapReduce Really Fast
Image Credit: http://bit.ly/17Vf8Hb
A Perfect
World
“Magic RAID”
TheRealWorld
Move the
Computation to
the Data!
Location
Awareness
Adds Complexity
Why
“Move the
Computation
to the Data”?
Remote Data
Access is Slow.
Why?
The Network is
Oversubscribed
Core
Aggregation
Edge
ure 1: Common data center interconnect topology. Host to switch links are GigE and links between swi...
What if I told you
the network isn’t
oversubscribed?
Consequences
• No local vs. remote disk distinction
• Simpler work schedulers
• Simpler programming models
FDS
Object Storage
Assuming
No Oversubscription
Motivation
Architecture and API
Metadata Management
Replication and Recovery
Data Transport
Why FDS Matters
Blob 0xbadf00d
Tract 0 Tract 1 Tract 2 Tract n...
8 MB
CreateBlob
OpenBlob
CloseBlob
DeleteBlob
GetBlobSize
ExtendBlob
Rea...
API Guarantees
• Tractserver writes are atomic
• Calls are asynchronous
- Allows deep pipelining
• Weak consistency to cli...
Motivation
Architecture and API
Metadata Management
Replication and Recovery
Data Transport
Why FDS Matters
Tract Locator Version TS
1 0 A
2 0 B
3 2 D
4 0 A
5 3 C
6 0 F
... ... ...
Tract Locator Table
Tract_Locator = 

TLT[(Hash(GUID) + Tract) % len(TLT)]
Randomize blob’s tractserver,
even if GUIDs aren’t random
(uses SHA-1)
Tract_Locator = 

TLT[(Hash(GUID) + Tract) % len(TL...
Large blobs use all TLT
entries uniformly
Tract_Locator = 

TLT[(Hash(GUID) + Tract) % len(TLT)]
Blob Metadata is Distributed
Tract_Locator = 

TLT[(Hash(GUID) - 1) % len(TLT)]
TLT Construction
• m Permutations of Tractserver List
• Weighted by disk speed
• Served by metadata server to clients
• On...
Tract Locator Version TS
1 0 A
2 0 B
3 2 D
4 0 A
5 3 C
6 0 F
... ... ...
Cluster Growth
Tract Locator Version TS
1 1 NEW / A
2 0 B
3 2 D
4 1 NEW / A
5 4 NEW / C
6 0 F
... ... ...
Cluster Growth
Tract Locator Version TS
1 2 NEW
2 0 A
3 2 A
4 2 NEW
5 5 NEW
6 0 A
... ... ...
Cluster Growth
Motivation
Architecture and API
Metadata Management
Replication and Recovery
Data Transport
Why FDS Matters
Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... .....
Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... .....
Replication
• Create, Delete, Extend:
- client writes to primary
- primary 2PC to replicas
• Write to all replicas
• Read ...
Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... .....
Tract Locator Version Replica 1 Replica 2 Replica 3
1 0 A B C
2 0 A C Z
3 0 A D H
4 0 A E M
5 0 A F G
6 0 A G P
... ... .....
MIND BLOWN
Motivation
Architecture and API
Metadata Management
Replication and Recovery
Data Transport
Why FDS Matters
Networking
Pod 0
10.0.2.1
10.0.1.1
Pod 1 Pod 3Pod 2
10.2.0.2 10.2.0.3
10.2.0.1
10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2
Core
10...
Networking
• Network bandwidth = disk bandwidth
• Full bisection bandwidth is stochastic
• Short flows good for ECMP
• TCP ...
Motivation
Architecture and API
Metadata Management
Replication and Recovery
Data Transport
Why FDS Matters
Co-Design
Hardware/Software
Combination
Designed for a
Specific Workload
FDS Works Great for
Blob Storage on
CLOS Networks
www.sortbenchmark.org
Indy
Daytona
MinuteSort - Daytona
System!
(Nodes)
Year
Data
Sorted
Speed 

per Disk
Hadoop
(1408)
2009 500GB 3 MB/s
FDS
(256)
2012 1470...
MinuteSort - Indy
System!
(Nodes)
Year
Data
Sorted
Speed 

per Disk
TritonSort
(66)
2012 1353GB 43.3 MB/s
FDS
(256)
2012 1...
FDS isn’t built for
oversubscribed
networks.

It’s also not a DBMS.
MapReduce and GFS:
Thousands of
Cheap PCs,
Bulk Synchronous
Processing
10x
MapReduce and GFS
Aren’t Designed for
or Iterative, or OLAP
FDS’ Lessons
• Great example of ground-up rethink
- Ambitious but implementable
• Big wins possible with co-design
• Const...
TritonSort & Themis
• Balanced hardware architecture
• Full bisection-bandwidth network
• Job-level fault tolerance
• Huge...
Thanks
@alexras • alexras@acm.org • alexras.info
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter Storage
Papers We Love January 2015 - Flat Datacenter Storage
Upcoming SlideShare
Loading in …5
×

Papers We Love January 2015 - Flat Datacenter Storage

844 views

Published on

Flat Datacenter Storage (FDS) is, as the intro describes, "a high-performance, fault-tolerant, large-scale, locality-oblivious blob store". It's also a great example of how carefully thought-out co-design of software and hardware for a target workload can yield really impressive performance results, even in the presence of heterogeneity and operating at scale. In my (admittedly biased) opinion, this style of system design doesn't get enough attention outside of academia, and has a lot to teach us about how data-intensive systems should be designed.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Papers We Love January 2015 - Flat Datacenter Storage

  1. 1. Flat Datacenter Storage Presented by Alex Rasmussen Papers We Love SF #11 2015-01-22 Edmund B. Nightingale, Jeremy Elson, Jinliang Fan,
 Owen Hofmann, Jon Howell, and Yutaka Suzue
  2. 2. @alexras
  3. 3. Sort Really Fast THEMIS MapReduce Really Fast
  4. 4. Image Credit: http://bit.ly/17Vf8Hb
  5. 5. A Perfect World
  6. 6. “Magic RAID”
  7. 7. TheRealWorld
  8. 8. Move the Computation to the Data!
  9. 9. Location Awareness Adds Complexity
  10. 10. Why “Move the Computation to the Data”?
  11. 11. Remote Data Access is Slow. Why?
  12. 12. The Network is Oversubscribed
  13. 13. Core Aggregation Edge ure 1: Common data center interconnect topology. Host to switch links are GigE and links between switches are 10 G 25 30 35 40 1:1 3:1 7:1 Fat-tree Hierarchical design Fat-tree Year 10 GigE Hosts Cost/ GigE Hosts C GigE G 2002 28-port 4,480 $25.3K 28-port 5,488 $ 2004 32-port 7,680 $4.4K 48-port 27,648 $ 2006 64-port 10,240 $2.1K 48-port 27,648 $ 2008 128-port 20,480 $1.8K 48-port 27,648 $ Aggregate Bandwidth Above Less Than Aggregate Demand Below Sometimes by 100x or more A B
  14. 14. What if I told you the network isn’t oversubscribed?
  15. 15. Consequences • No local vs. remote disk distinction • Simpler work schedulers • Simpler programming models
  16. 16. FDS Object Storage Assuming No Oversubscription
  17. 17. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  18. 18. Blob 0xbadf00d Tract 0 Tract 1 Tract 2 Tract n... 8 MB CreateBlob OpenBlob CloseBlob DeleteBlob GetBlobSize ExtendBlob ReadTract WriteTract
  19. 19. API Guarantees • Tractserver writes are atomic • Calls are asynchronous - Allows deep pipelining • Weak consistency to clients
  20. 20. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  21. 21. Tract Locator Version TS 1 0 A 2 0 B 3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Tract Locator Table
  22. 22. Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  23. 23. Randomize blob’s tractserver, even if GUIDs aren’t random (uses SHA-1) Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  24. 24. Large blobs use all TLT entries uniformly Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  25. 25. Blob Metadata is Distributed Tract_Locator = 
 TLT[(Hash(GUID) - 1) % len(TLT)]
  26. 26. TLT Construction • m Permutations of Tractserver List • Weighted by disk speed • Served by metadata server to clients • Only update when cluster changes
  27. 27. Tract Locator Version TS 1 0 A 2 0 B 3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Cluster Growth
  28. 28. Tract Locator Version TS 1 1 NEW / A 2 0 B 3 2 D 4 1 NEW / A 5 4 NEW / C 6 0 F ... ... ... Cluster Growth
  29. 29. Tract Locator Version TS 1 2 NEW 2 0 A 3 2 A 4 2 NEW 5 5 NEW 6 0 A ... ... ... Cluster Growth
  30. 30. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  31. 31. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication
  32. 32. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication
  33. 33. Replication • Create, Delete, Extend: - client writes to primary - primary 2PC to replicas • Write to all replicas • Read from random replica
  34. 34. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery
  35. 35. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery Recover 1TB from 3000 disks in < 20 seconds H E A L M E
  36. 36. MIND BLOWN
  37. 37. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  38. 38. Networking Pod 0 10.0.2.1 10.0.1.1 Pod 1 Pod 3Pod 2 10.2.0.2 10.2.0.3 10.2.0.1 10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2 Core 10.2.2.1 10.0.1.2 Edge Aggregation Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to destination 10.2.0.3 would take the dashed path. Prefix 10.2.0.0/24 10.2.1.0/24 0.0.0.0/0 Output port 0 1 Suffix Output port Next hop 10.2.0.1 10.2.1.1 10.4.1.1 Address 00 01 10 Output port 0 1 2 RAM Encoder 10.2.0.X 10.2.1.X X.X.X.2 TCAM CLOS topology: small switches + ECMP 
 = full bisection bandwidth
  39. 39. Networking • Network bandwidth = disk bandwidth • Full bisection bandwidth is stochastic • Short flows good for ECMP • TCP hates short flows • RTS/CTS to mitigate incast; see paper
  40. 40. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  41. 41. Co-Design
  42. 42. Hardware/Software Combination Designed for a Specific Workload
  43. 43. FDS Works Great for Blob Storage on CLOS Networks
  44. 44. www.sortbenchmark.org
  45. 45. Indy Daytona
  46. 46. MinuteSort - Daytona System! (Nodes) Year Data Sorted Speed 
 per Disk Hadoop (1408) 2009 500GB 3 MB/s FDS (256) 2012 1470GB 46 MB/s
  47. 47. MinuteSort - Indy System! (Nodes) Year Data Sorted Speed 
 per Disk TritonSort (66) 2012 1353GB 43.3 MB/s FDS (256) 2012 1470GB 47.9 MB/s
  48. 48. FDS isn’t built for oversubscribed networks.
 It’s also not a DBMS.
  49. 49. MapReduce and GFS: Thousands of Cheap PCs, Bulk Synchronous Processing
  50. 50. 10x MapReduce and GFS Aren’t Designed for or Iterative, or OLAP
  51. 51. FDS’ Lessons • Great example of ground-up rethink - Ambitious but implementable • Big wins possible with co-design • Constantly re-examine assumptions
  52. 52. TritonSort & Themis • Balanced hardware architecture • Full bisection-bandwidth network • Job-level fault tolerance • Huge wins possible - Beat 3000+ node cluster by 35% with 52 nodes • NSDI 2012, SoCC 2013
  53. 53. Thanks @alexras • alexras@acm.org • alexras.info

×