Advertisement

Papers We Love January 2015 - Flat Datacenter Storage

Data Engineering Consultant
Jun. 29, 2015
Advertisement

More Related Content

Similar to Papers We Love January 2015 - Flat Datacenter Storage(20)

Advertisement

Papers We Love January 2015 - Flat Datacenter Storage

  1. Flat Datacenter Storage Presented by Alex Rasmussen Papers We Love SF #11 2015-01-22 Edmund B. Nightingale, Jeremy Elson, Jinliang Fan,
 Owen Hofmann, Jon Howell, and Yutaka Suzue
  2. @alexras
  3. Sort Really Fast THEMIS MapReduce Really Fast
  4. Image Credit: http://bit.ly/17Vf8Hb
  5. A Perfect World
  6. “Magic RAID”
  7. TheRealWorld
  8. Move the Computation to the Data!
  9. Location Awareness Adds Complexity
  10. Why “Move the Computation to the Data”?
  11. Remote Data Access is Slow. Why?
  12. The Network is Oversubscribed
  13. Core Aggregation Edge ure 1: Common data center interconnect topology. Host to switch links are GigE and links between switches are 10 G 25 30 35 40 1:1 3:1 7:1 Fat-tree Hierarchical design Fat-tree Year 10 GigE Hosts Cost/ GigE Hosts C GigE G 2002 28-port 4,480 $25.3K 28-port 5,488 $ 2004 32-port 7,680 $4.4K 48-port 27,648 $ 2006 64-port 10,240 $2.1K 48-port 27,648 $ 2008 128-port 20,480 $1.8K 48-port 27,648 $ Aggregate Bandwidth Above Less Than Aggregate Demand Below Sometimes by 100x or more A B
  14. What if I told you the network isn’t oversubscribed?
  15. Consequences • No local vs. remote disk distinction • Simpler work schedulers • Simpler programming models
  16. FDS Object Storage Assuming No Oversubscription
  17. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  18. Blob 0xbadf00d Tract 0 Tract 1 Tract 2 Tract n... 8 MB CreateBlob OpenBlob CloseBlob DeleteBlob GetBlobSize ExtendBlob ReadTract WriteTract
  19. API Guarantees • Tractserver writes are atomic • Calls are asynchronous - Allows deep pipelining • Weak consistency to clients
  20. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  21. Tract Locator Version TS 1 0 A 2 0 B 3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Tract Locator Table
  22. Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  23. Randomize blob’s tractserver, even if GUIDs aren’t random (uses SHA-1) Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  24. Large blobs use all TLT entries uniformly Tract_Locator = 
 TLT[(Hash(GUID) + Tract) % len(TLT)]
  25. Blob Metadata is Distributed Tract_Locator = 
 TLT[(Hash(GUID) - 1) % len(TLT)]
  26. TLT Construction • m Permutations of Tractserver List • Weighted by disk speed • Served by metadata server to clients • Only update when cluster changes
  27. Tract Locator Version TS 1 0 A 2 0 B 3 2 D 4 0 A 5 3 C 6 0 F ... ... ... Cluster Growth
  28. Tract Locator Version TS 1 1 NEW / A 2 0 B 3 2 D 4 1 NEW / A 5 4 NEW / C 6 0 F ... ... ... Cluster Growth
  29. Tract Locator Version TS 1 2 NEW 2 0 A 3 2 A 4 2 NEW 5 5 NEW 6 0 A ... ... ... Cluster Growth
  30. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  31. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication
  32. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Replication
  33. Replication • Create, Delete, Extend: - client writes to primary - primary 2PC to replicas • Write to all replicas • Read from random replica
  34. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery
  35. Tract Locator Version Replica 1 Replica 2 Replica 3 1 0 A B C 2 0 A C Z 3 0 A D H 4 0 A E M 5 0 A F G 6 0 A G P ... ... ... ... ... Recovery Recover 1TB from 3000 disks in < 20 seconds H E A L M E
  36. MIND BLOWN
  37. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  38. Networking Pod 0 10.0.2.1 10.0.1.1 Pod 1 Pod 3Pod 2 10.2.0.2 10.2.0.3 10.2.0.1 10.4.1.1 10.4.1.2 10.4.2.1 10.4.2.2 Core 10.2.2.1 10.0.1.2 Edge Aggregation Figure 3: Simple fat-tree topology. Using the two-level routing tables described in Section 3.3, packets from source 10.0.1.2 to destination 10.2.0.3 would take the dashed path. Prefix 10.2.0.0/24 10.2.1.0/24 0.0.0.0/0 Output port 0 1 Suffix Output port Next hop 10.2.0.1 10.2.1.1 10.4.1.1 Address 00 01 10 Output port 0 1 2 RAM Encoder 10.2.0.X 10.2.1.X X.X.X.2 TCAM CLOS topology: small switches + ECMP 
 = full bisection bandwidth
  39. Networking • Network bandwidth = disk bandwidth • Full bisection bandwidth is stochastic • Short flows good for ECMP • TCP hates short flows • RTS/CTS to mitigate incast; see paper
  40. Motivation Architecture and API Metadata Management Replication and Recovery Data Transport Why FDS Matters
  41. Co-Design
  42. Hardware/Software Combination Designed for a Specific Workload
  43. FDS Works Great for Blob Storage on CLOS Networks
  44. www.sortbenchmark.org
  45. Indy Daytona
  46. MinuteSort - Daytona System! (Nodes) Year Data Sorted Speed 
 per Disk Hadoop (1408) 2009 500GB 3 MB/s FDS (256) 2012 1470GB 46 MB/s
  47. MinuteSort - Indy System! (Nodes) Year Data Sorted Speed 
 per Disk TritonSort (66) 2012 1353GB 43.3 MB/s FDS (256) 2012 1470GB 47.9 MB/s
  48. FDS isn’t built for oversubscribed networks.
 It’s also not a DBMS.
  49. MapReduce and GFS: Thousands of Cheap PCs, Bulk Synchronous Processing
  50. 10x MapReduce and GFS Aren’t Designed for or Iterative, or OLAP
  51. FDS’ Lessons • Great example of ground-up rethink - Ambitious but implementable • Big wins possible with co-design • Constantly re-examine assumptions
  52. TritonSort & Themis • Balanced hardware architecture • Full bisection-bandwidth network • Job-level fault tolerance • Huge wins possible - Beat 3000+ node cluster by 35% with 52 nodes • NSDI 2012, SoCC 2013
  53. Thanks @alexras • alexras@acm.org • alexras.info
Advertisement