Storage Systems

863 views

Published on

Lecture on Storage Systems
http://www.rust-class.org

Engineering tradeoffs in cost, latency, and robustness in storing data

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
863
On SlideShare
0
From Embeds
0
Number of Embeds
508
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Storage Systems

  1. 1. 12 November 2013 University of Virginia cs4414 1
  2. 2. Why is storage complicated? 12 November 2013 University of Virginia cs4414 2
  3. 3. Delay Lines 12 November 2013 University of Virginia cs4414 3
  4. 4. Mercury Delay Lines 0/1 12 November 2013 University of Virginia cs4414 4
  5. 5. 12 November 2013 University of Virginia cs4414 5
  6. 6. 12 November 2013 University of Virginia cs4414 6
  7. 7. Speed of Sound Air 343 m/s Mercury 1450 m/s (40° C) Water 1500 m/s (25° C) 12 November 2013 Why Mercury? 0/1 University of Virginia cs4414 7
  8. 8. Magnetic Core Memory MIT Project Whirlwind, 1951 2K 16-bit words with “no waiting”! 12 November 2013 University of Virginia cs4414 8
  9. 9. SRAM NOT NOT 12 November 2013 University of Virginia cs4414 9
  10. 10. Four-Transistor SRAM Bit 12 November 2013 University of Virginia cs4414 10
  11. 11. Modern DRAM 12 November 2013 University of Virginia cs4414 11
  12. 12. After Turning off Power 5 seconds 12 November 2013 30 seconds University of Virginia cs4414 5 minutes 12
  13. 13. cycles (at 800MHz) to read a particular row = 13.75ns = 185° F 12 November 2013 University of Virginia cs4414 13
  14. 14. Storage Systems Device Mercury (Gin) Delay Line Example UNIVAC (1951) Time to Access 220,000ns (average) Cost per Bit $ 0.38 (1968) (a bazillion n$) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) 13.75ns 1.16 n$ UNIVAC 1968 (Core memory): $823,500 for 131 K 16-bit words 12 November 2013 University of Virginia cs4414 14
  15. 15. Cheaper, More Persistent Storage 12 November 2013 University of Virginia cs4414 15
  16. 16. How big is a TB? 12 November 2013 University of Virginia cs4414 16
  17. 17. Storage Systems Device Example Time to Access Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) 13.75ns 1.16 n$ Hard Drive Seagate Desktop HDD 4 TB SATA 6Gb/s NCQ 64MB ? 0.0046 n$ 12 November 2013 University of Virginia cs4414 Cost per Bit $ 0.38 (1968) (a bazillion n$) 17
  18. 18. Accessing a Hard Drive “seek time” ~ 0.1ms rotate time: 1/5900rpm ~ max 10ms 12 November 2013 University of Virginia cs4414 5900 rpm spindle 18
  19. 19. Passing the Drop Test 12 November 2013 University of Virginia cs4414 19
  20. 20. Passing the Drop Test 12 November 2013 University of Virginia cs4414 20
  21. 21. Storage Systems Device Example Time to Access Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) 13.75ns 1.16 n$ Hard Drive Seagate Desktop HDD 4 TB SATA 6Gb/s NCQ 64MB 5ms (ave) 0.0046 n$ 12 November 2013 University of Virginia cs4414 Cost per Bit $ 0.38 (1968) (a bazillion n$) 21
  22. 22. Storage Abstractions 12 November 2013 University of Virginia cs4414 22
  23. 23. 12 November 2013 University of Virginia cs4414 23
  24. 24. 12 November 2013 University of Virginia cs4414 24
  25. 25. Storage Abstractions Memory Location File Do we really need both? 12 November 2013 University of Virginia cs4414 What about: database, URI? 25
  26. 26. Unix File Abstraction 12 November 2013 University of Virginia cs4414 26
  27. 27. Which are files? class24.pptx /Users/dave/OS/classes/ OS-provided random numbers 12 November 2013 University of Virginia cs4414 27
  28. 28. “Everything is a File” class24.pptx /mnt/cdrom /Users/dave/OS/classes/ OS-provided random numbers /dev/tty0 /dev/random 12 November 2013 University of Virginia cs4414 28
  29. 29. inode represents a file Size of File (bytes) Device ID User ID Group ID File Mode (permission bits) Link count (number of hard links to node) … Diskmap 12 November 2013 University of Virginia cs4414 29
  30. 30. include/linux/fs.h 12 November 2013 University of Virginia cs4414 30
  31. 31. Size of File (bytes) Device ID User ID Group ID stat File Mode (permission bits) Link count (number of hard links to node) … Diskmap > stat -x class24.pptx File: "class24.pptx" Size: 5855495 FileType: Regular File Mode: (0644/-rw-r--r--) Uid: ( 501/ dave) Gid: ( 20/ staff) Device: 1,2 Inode: 6706357 Links: 1 Access: Wed Nov 20 15:00:41 2013 Modify: Wed Nov 20 14:23:13 2013 Change: Wed Nov 20 14:23:13 2013 12 November 2013 University of Virginia cs4414 31
  32. 32. > ln class24.pptx todays-class.pptx > stat -x class24.pptx File: "class24.pptx" Size: 5855495 FileType: Regular File Mode: (0644/-rw-r--r--) Uid: ( 501/ dave) Gid: ( 20/ staff) Device: 1,2 Inode: 6706357 Links: 2 Access: .. > stat -x todays-class.pptx File: "todays-class.pptx" Size: 5855495 FileType: Regular File Mode: (0644/-rw-r--r--) Uid: ( 501/ dave) Gid: ( 20/ staff) Device: 1,2 Inode: 6706357 Links: 2 > rm class24.pptx > stat -x class24.pptx stat: class24.pptx: stat: No such file or directory > stat -x todays-class.pptx File: "todays-class.pptx" Size: 5855495 FileType: Regular File Mode: (0644/-rw-r--r--) Uid: ( 501/ dave) Gid: ( 20/ staff) Device: 1,2 Inode: 6706357 Links: 1 12 November 2013 University of Virginia cs4414 32
  33. 33. Removing a linked file like this is very confusing for PowerPoint… 12 November 2013 University of Virginia cs4414 33
  34. 34. Size of File (bytes) Diskmap (Unix System 5) Device ID User ID Group ID File Mode (permission bits) 0 Link count (number of hard links to node) … 1 2 Diskmap … 9 10 Disk Block (1K bytes) Disk Block (1K bytes) 11 12 12 November 2013 Disk Block (1K bytes) University of Virginia cs4414 34
  35. 35. Diskmap (Unix System 5) 0 1 Disk Block Disk Block (1K Block Diskbytes) (1K bytes) (1K bytes) Indirect Disk Block (1K bytes) 4 bytes for each = 256 pointers 2 … 9 10 Disk Block (1K bytes) Disk Block (1K bytes) 11 12 12 November 2013 Disk Block (1K bytes) University of Virginia cs4414 35
  36. 36. Diskmap (Unix System 5) 0 1 2 … 9 Indirect Disk Block (1K bytes) Disk Block Disk Block (1K Block Diskbytes) (1K bytes) (1K bytes) 4 bytes for each = 256 pointers Double Indirect Disk Block Indirect Indirect Disk Block Disk Block (1K bytes) (1K bytes) D DD ( (1 ( 10 11 12 12 November 2013 University of Virginia cs4414 36
  37. 37. Diskmap (Unix System 5) 0 1 2 … 9 Indirect Disk Block (1K bytes) Disk Block Disk Block (1K Block Diskbytes) (1K bytes) (1K bytes) 4 bytes for each = 256 pointers Double Indirect Disk Block Indirect Indirect Disk Block Disk Block (1K bytes) (1K bytes) D DD ( (1 ( 10 11 12 12 November 2013 How would you determine if your file system has this structure? University of Virginia cs4414 37
  38. 38. Diskmap (Unix System 5) 0 1 2 … 9 Indirect Disk Block (1K bytes) Disk Block Disk Block (1K Block Diskbytes) (1K bytes) (1K bytes) 4 bytes for each = 256 pointers Double Indirect Disk Block Indirect Indirect Disk Block Disk Block (1K bytes) (1K bytes) D DD ( (1 ( 10 11 12 12 November 2013 Disk Block (1K bytes) University of Virginia cs4414 38
  39. 39. Directories are Files Too! Filename Inode . .. .DS_Store 494211 494205 494212 class0 class1 class10 class11 … class19 class2 … November 2013 12 6565946 6565826 1467012 2252968 … 5649155 494218 … University of Virginia cs4414 ls -ali 39
  40. 40. > brew install tree # needed on MacOS X, but builtin to most Unixes 12 November 2013 University of Virginia cs4414 40
  41. 41. How to create a new file? 12 November 2013 University of Virginia cs4414 41
  42. 42. Finding a Free Block Data 0 1 … I-List (inodes) 98 99 0 1 … 98 99 Superblock List of free disk blocks Boot block 12 November 2013 Not to scale! University of Virginia cs4414 42
  43. 43. Finding a Free inode Data 0 1 2 3 … I-List (inodes) Superblock Boot block 12 November 2013 0 1 0 0 … Superblock keeps a cache of free inodes Not to scale! University of Virginia cs4414 43
  44. 44. Modern File Systems 12 November 2013 University of Virginia cs4414 44
  45. 45. What should a modern file system do that Unix S5FS doesn’t? 12 November 2013 University of Virginia cs4414 45
  46. 46. Handling Failures ZFS Developed for Solaris, 2005 Now open source: http://open-zfs.org/ “MacZFS is free data storage and protection software for all Mac OS users. It's for people who have Mac OS, who have any data, and who really like their data. Whether on a single-drive laptop or on a massive server, it'll store your petabytes with ragingly redundant RAID reliability, and it'll keep the bit-rotted bleeps and bloops out of your iTunes library.” 12 November 2013 University of Virginia cs4414 46
  47. 47. Block Checksums 0 Checksum Block (SHA-256) 0 40a3dc… 1 1 2c5829d… 2 2 955d253 … … … 9 Disk Block (1K bytes) 10 … ZFS 11 12 S5FS 12 November 2013 How do you check the checksums? University of Virginia cs4414 47
  48. 48. Hashing the Hashes Hash(B1) Hash(B2) Hash(B2) Hash(B2) Block 1 Block 2 Block 3 Block 4 12 November 2013 University of Virginia cs4414 48
  49. 49. Merkle Tree Ralph Merkle Hash(B1) Hash(B2) Hash(B2) Hash(B2) Block 1 Block 2 Block 3 Block 4 12 November 2013 University of Virginia cs4414 49
  50. 50. Recovery Copy 1 One Copy Copy 2 Keep 2 copies of every block: if checksum fails for first copy read, try reading second copy. 12 November 2013 copies = 2 University of Virginia cs4414 50
  51. 51. For the truly paranoid… Copy 1 One Copy Copy 2 Copy 3 copies = 3 12 November 2013 University of Virginia cs4414 51
  52. 52. For the fairly paranoid but cheap… RAID Redundant Arrays of Inexpensive Disks ACM SIGMOD 1988 whitehouse.gov 12 November 2013 University of Virginia cs4414 52
  53. 53. Case for RAID 12 November 2013 University of Virginia cs4414 53
  54. 54. 12 November 2013 University of Virginia cs4414 54
  55. 55. Redundancy 12 November 2013 University of Virginia cs4414 55
  56. 56. 12 November 2013 University of Virginia cs4414 56
  57. 57. Improving Performance Cache (64MB DRAM) Adaptive Replacement Cache 12 November 2013 University of Virginia cs4414 57
  58. 58. Adaptive Replacement Cache Blocks in Cache Accessed Again T1: Recent Cache Entries T2: Frequently-Used Blocks “Ghost” Entries Size of T1 adapts B1: Evicted from T1 (LRU) B2: Evicted from T2 (LRU) How should relative size of T1 and T2 be adjusted? 12 November 2013 University of Virginia cs4414 58
  59. 59. Adaptive Replacement Cache Blocks in Cache Accessed Again T1: Recent Cache Entries T2: Frequently-Used Blocks “Ghost” Entries Size of T1 adapts B1: Evicted from T1 (LRU) B2: Evicted from T2 (LRU) Hit in B1: should increase size of T1, drop entry from T2 to B2 Hit in B2: should increase size of T2, drop entry from T1 to B1 12 November 2013 University of Virginia cs4414 59
  60. 60. IBM Almaden Research Center 12 November 2013 University of Virginia cs4414 60
  61. 61. Do you actually have a disk like this on your main computing device? Cache (64MB DRAM) 12 November 2013 University of Virginia cs4414 61
  62. 62. Flash Memory Solid State Drive 12 November 2013 University of Virginia cs4414 62
  63. 63. Storage Systems Device Example Time to Access Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) 13.75ns 1.16 n$ Hard Drive Seagate Desktop HDD 4 TB SATA 6Gb/s NCQ 64MB 5,000,000ns 0.0046 n$ SSD Samsung 500GB ($300) ? 0.075 n$ 12 November 2013 University of Virginia cs4414 Cost per Bit $ 0.38 (1968) (a bazillion n$) 63
  64. 64. 12 November 2013 University of Virginia cs4414 64
  65. 65. 12 November 2013 University of Virginia cs4414 65
  66. 66. 12 November 2013 University of Virginia cs4414 66
  67. 67. Storage Systems Device Modern Hard Drive Mercury (Gin) Delay Line Example Time to Access UNIVAC (1951) 220,000ns (average) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) SSD Samsung ~10,000 ns 500GB ($300) (for random read) Disk Drive 12 November 2013 Seagate Desktop HDD 4 TB SATA 6Gb/s NCQ 64MB 13.75ns 5,000,000ns University of Virginia cs4414 Cost per Bit $ 0.38 (1968) (a bazillion n$) 1.16 n$ 0.075 n$ 0.0046 n$ 67
  68. 68. Storage systems should be designed around hardware capabilities and workload Today’s OSes mostly use filesystems designed around 1990s disks and 1960s workloads! But, with lots of clever hacks to make them work okay on today’s hardware and workloads 12 November 2013 University of Virginia cs4414 Charge More from Wilkes 1967: 68

×