In-memory Caching in HDFS: Lower
Latency, Same Great Taste
Andrew Wang and Colin McCabe
Cloudera
2
In-memory Caching in HDFS
Lower latency, same great taste
Andrew Wang | awang@cloudera.com
Colin McCabe | cmccabe@cloude...
Alice
Hadoop
cluster
Query
Result set
Alice
Fresh data
Fresh data
Alice
Rollup
Problems
• Data hotspots
• Everyone wants to query some fresh data
• Shared disks are unable to handle high load
• Mixed w...
How do we solve I/O issues?
• Cache important datasets in memory!
• Much higher throughput than disk
• Fast random/concurr...
Alice
Page cache
Alice
Repeated
query
?
Alice
Rollup
Alice
Extra copies
Alice
Checksum
verification
Extra copies
Design Considerations
1. Placing tasks for memory locality
• Expose cache locations to application schedulers
2. Contentio...
Outline
• Implementation
• NameNode and DataNode modifications
• Zero-copy read API
• Evaluation
• Microbenchmarks
• MapRe...
Outline
• Implementation
• NameNode and DataNode modifications
• Zero-copy read API
• Evaluation
• Microbenchmarks
• MapRe...
Cache Directives
• A cache directive describes a file or directory that
should be cached
• Path
• Cache replication factor...
Architecture
18
DataNode
DataNode DataNode
NameNode
DFSClient
Cache /foo
Cache commandsCache Heartbeats
DFSClient
getBlock...
mlock
• The DataNode pins each cached block into the page
cache using mlock, and checksums it.
• Because we’re using the p...
Zero-copy read API
• Clients can use the zero-copy read API to map the
cached replica into their own address space
• The z...
Zero-copy read API
New FSDataInputStream methods:
ByteBuffer read(ByteBufferPool pool,
int maxLength, EnumSet<ReadOption> ...
Skipping Checksums
• We would like to skip checksum verification when
reading cached data
• DataNode already checksums whe...
Skipping Checksums
• The DataNode and DFSClient use shared memory
segments to communicate which blocks are cached.
23
Data...
Skipping Checksums
24
Block 123
DataNode DFSClient
Can Skip Csums
In Use
Zero-Copy
MappedByteBuffer
Skipping Checksums
25
Block 123
DataNode DFSClient
Can Skip Csums
In Use
Zero-Copy
MappedByteBuffer
Architecture Summary
• The Cache Directive API provides per-file control over
what is cached
• The NameNode tracks cached ...
Outline
• Implementation
• NameNode and DataNode modifications
• Zero-copy read API
• Evaluation
• Single-Node Microbenchm...
Test Node
• 48GB of RAM
• Configured 38GB of HDFS cache
• 11x SATA hard disks
• 2x4 core 2.13 GHz Westmere Xeon processors...
Single-Node Microbenchmarks
• How much faster are cached and zero-copy reads?
• Introducing vecsum (vector sum)
• Computes...
Throughput Reading 1G File 20x
30
0.8 0.9
1.9
2.4
5.9
0
1
2
3
4
5
6
7
TCP TCP no
csums
SCR SCR no
csums
ZCR
GB/s
ZCR 1GB vs 20GB
31
5.9
2.7
0
1
2
3
4
5
6
7
1GB 20GB
GB/s
Throughput
• Skipping checksums matters more when going faster
• ZCR gets close to bus bandwidth
• ~6GB/s
• Need to reuse ...
Client CPU cycles
33
57.6
51.8
27.1
23.4
12.7
0
10
20
30
40
50
60
70
TCP TCP no
csums
SCR SCR no
csums
ZCR
CPUcycles(billi...
Why is ZCR more CPU-efficient?
34
Why is ZCR more CPU-efficient?
35
Remote Cached vs. Local Uncached
• Zero-copy is only possible for local cached data
• Is it better to read from remote cac...
Remote Cached vs. Local Uncached
37
841
1092
125 137
0
200
400
600
800
1000
1200
TCP iperf SCR dd
MB/s
Microbenchmark Conclusions
• Short-circuit reads need less CPU than TCP reads
• ZCR is even more efficient, because it avo...
Outline
• Implementation
• NameNode and DataNode modifications
• Zero-copy read API
• Evaluation
• Microbenchmarks
• MapRe...
MapReduce
• Started with example MR jobs
• Wordcount
• Grep
• 5 node cluster: 4 DNs, 1 NN
• Same hardware configuration as...
wordcount and grep
41
275
52
280
55
0
50
100
150
200
250
300
350
400
wordcount wordcount
cached
grep grep cached
wordcount and grep
42
275
52
280
55
0
50
100
150
200
250
300
350
400
wordcount wordcount
cached
grep grep cached
Almost no...
wordcount and grep
43
275
52
280
55
0
50
100
150
200
250
300
350
400
wordcount wordcount
cached
grep grep cached
~60MB/s
~...
wordcount and grep
• End-to-end latency barely changes
• These MR jobs are simply not I/O bound!
• Best map phase throughp...
Introducing bytecount
• Trivial version of wordcount
• Counts # of occurrences of byte values
• Heavily CPU optimized
• Ea...
bytecount
46
52
39 35
55
45
58
0
10
20
30
40
50
60
70
bytecount
47
52
39 35
55
45
58
0
10
20
30
40
50
60
70
1.3x faster
bytecount
48
52
39 35
55
45
58
0
10
20
30
40
50
60
70
Still only
~500MB/s
MapReduce Conclusions
49
• Many MR jobs will see marginal improvement
• Startup costs
• CPU inefficiencies
• Shuffle and r...
Outline
• Implementation
• NameNode and DataNode modifications
• Zero-copy read API
• Evaluation
• Microbenchmarks
• MapRe...
Impala Benchmarks
• Open-source OLAP database developed by Cloudera
• Tested with Impala 1.3 (CDH 5)
• Same 4 DN cluster a...
Impala Benchmarks
• 1TB TPC-DS store_sales table, text format
• count(*) on different numbers of partitions
• Has to scan ...
Small Query
53
19.8
5.8
4.0 3.0
0
5
10
15
20
25
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(s)
Small Query
54
19.8
5.8
4.0 3.0
0
5
10
15
20
25
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(s)
2...
Small Query
55
0
5
10
15
20
25
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(s)
3.4x faster,
disk ...
Small Query
56
0
5
10
15
20
25
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(s)
1.3x after warmup,...
Big Query
57
48.2
11.5
40.9
9.4
0
10
20
30
40
50
60
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(...
Big Query
58
0
10
20
30
40
50
60
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(s)
4.2x
faster, dis...
Big Query
59
0
10
20
30
40
50
60
Uncached
cold
Cached cold Uncached
hot
Cached hot
Averageresponsetime(s)
4.3x
faster, doe...
Small Query with Concurrent Workload
60
0
10
20
30
40
50
60
Uncached Cached Cached (not
concurrent)
Averageresponsetime(s)
Small Query with Concurrent Workload
61
0
10
20
30
40
50
60
Uncached Cached Cached (not
concurrent)
Averageresponsetime(s)...
Small Query with Concurrent Workload
62
0
10
20
30
40
50
60
Uncached Cached Cached (not
concurrent)
Averageresponsetime(s)...
Impala Conclusions
• HDFS cache is faster than disk or page cache
• ZCR is more efficient than SCR from page cache
• Bette...
Outline
• Implementation
• NameNode and DataNode modifications
• Zero-copy read API
• Evaluation
• Microbenchmarks
• MapRe...
Future Work
• Automatic cache replacement
• LRU, LFU, ?
• Sub-block caching
• Potentially important for automatic cache re...
Conclusion
• I/O contention is a problem for concurrent workloads
• HDFS can now explicitly pin working sets into RAM
• Ap...
bytecount
70
52
39 35
55
45
58
0
10
20
30
40
50
60
70
Less disk parallelism
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
Upcoming SlideShare
Loading in...5
×

In-memory Caching in HDFS: Lower Latency, Same Great Taste

4,634

Published on

Published in: Technology

In-memory Caching in HDFS: Lower Latency, Same Great Taste

  1. 1. In-memory Caching in HDFS: Lower Latency, Same Great Taste Andrew Wang and Colin McCabe Cloudera
  2. 2. 2 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com
  3. 3. Alice Hadoop cluster Query Result set
  4. 4. Alice Fresh data
  5. 5. Fresh data
  6. 6. Alice Rollup
  7. 7. Problems • Data hotspots • Everyone wants to query some fresh data • Shared disks are unable to handle high load • Mixed workloads • Data analyst making small point queries • Rollup job scanning all the data • Point query latency suffers because of I/O contention • Same theme: disk I/O contention! 7
  8. 8. How do we solve I/O issues? • Cache important datasets in memory! • Much higher throughput than disk • Fast random/concurrent access • Interesting working sets often fit in cluster memory • Traces from Facebook’s Hive cluster • Increasingly affordable to buy a lot of memory • Moore’s law • 1TB RAM server is 40k on HP’s website 8
  9. 9. Alice Page cache
  10. 10. Alice Repeated query ?
  11. 11. Alice Rollup
  12. 12. Alice Extra copies
  13. 13. Alice Checksum verification Extra copies
  14. 14. Design Considerations 1. Placing tasks for memory locality • Expose cache locations to application schedulers 2. Contention for page cache from other users • Explicitly pin hot datasets in memory 3. Extra copies when reading cached data • Zero-copy API to read cached data 14
  15. 15. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 15
  16. 16. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 16
  17. 17. Cache Directives • A cache directive describes a file or directory that should be cached • Path • Cache replication factor: 1-N • Stored permanently on the NameNode • Also have cache pools for access control and quotas, but we won’t be covering that here 17
  18. 18. Architecture 18 DataNode DataNode DataNode NameNode DFSClient Cache /foo Cache commandsCache Heartbeats DFSClient getBlockLocations
  19. 19. mlock • The DataNode pins each cached block into the page cache using mlock, and checksums it. • Because we’re using the page cache, the blocks don’t take up any space on the Java heap. 19 DataNode Page Cache DFSClient read mlock
  20. 20. Zero-copy read API • Clients can use the zero-copy read API to map the cached replica into their own address space • The zero-copy API avoids the overhead of the read() and pread() system calls • However, we don’t verify checksums when using the zero-copy API • The zero-copy API can be only used on cached data, or when the application computes its own checksums. 20
  21. 21. Zero-copy read API New FSDataInputStream methods: ByteBuffer read(ByteBufferPool pool, int maxLength, EnumSet<ReadOption> opts); void releaseBuffer(ByteBuffer buffer); 21
  22. 22. Skipping Checksums • We would like to skip checksum verification when reading cached data • DataNode already checksums when caching the block • Enables more efficient SCR, ZCR • Requirements • Client needs to know that the replica is cached • DataNode needs to notify the client if the replica is uncached 22
  23. 23. Skipping Checksums • The DataNode and DFSClient use shared memory segments to communicate which blocks are cached. 23 DataNode Page Cache DFSClient read mlock Shared Memory Segment
  24. 24. Skipping Checksums 24 Block 123 DataNode DFSClient Can Skip Csums In Use Zero-Copy MappedByteBuffer
  25. 25. Skipping Checksums 25 Block 123 DataNode DFSClient Can Skip Csums In Use Zero-Copy MappedByteBuffer
  26. 26. Architecture Summary • The Cache Directive API provides per-file control over what is cached • The NameNode tracks cached blocks and coordinates DataNode cache work • The DataNodes use mlock to lock page cache blocks into memory • The DFSClient can determine whether it is safe to skip checksums via the shared memory segment • Caching makes it possible to use the efficient Zero- Copy API on cached data 26
  27. 27. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Single-Node Microbenchmarks • MapReduce • Impala • Future work 27
  28. 28. Test Node • 48GB of RAM • Configured 38GB of HDFS cache • 11x SATA hard disks • 2x4 core 2.13 GHz Westmere Xeon processors • 10 Gbit/s full-bisection bandwidth network 28
  29. 29. Single-Node Microbenchmarks • How much faster are cached and zero-copy reads? • Introducing vecsum (vector sum) • Computes sums of a file of doubles • Highly optimized: uses SSE intrinsics • libhdfs program • Can toggle between various read methods • Terminology • SCR: short-circuit reads • ZCR: zero-copy reads 29
  30. 30. Throughput Reading 1G File 20x 30 0.8 0.9 1.9 2.4 5.9 0 1 2 3 4 5 6 7 TCP TCP no csums SCR SCR no csums ZCR GB/s
  31. 31. ZCR 1GB vs 20GB 31 5.9 2.7 0 1 2 3 4 5 6 7 1GB 20GB GB/s
  32. 32. Throughput • Skipping checksums matters more when going faster • ZCR gets close to bus bandwidth • ~6GB/s • Need to reuse client-side mmaps for maximum perf • page_fault function is 1.16% of cycles in 1G • 17.55% in 20G 32
  33. 33. Client CPU cycles 33 57.6 51.8 27.1 23.4 12.7 0 10 20 30 40 50 60 70 TCP TCP no csums SCR SCR no csums ZCR CPUcycles(billions)
  34. 34. Why is ZCR more CPU-efficient? 34
  35. 35. Why is ZCR more CPU-efficient? 35
  36. 36. Remote Cached vs. Local Uncached • Zero-copy is only possible for local cached data • Is it better to read from remote cache, or local disk? 36
  37. 37. Remote Cached vs. Local Uncached 37 841 1092 125 137 0 200 400 600 800 1000 1200 TCP iperf SCR dd MB/s
  38. 38. Microbenchmark Conclusions • Short-circuit reads need less CPU than TCP reads • ZCR is even more efficient, because it avoids a copy • ZCR goes much faster when re-reading the same data, because it can avoid mmap page faults • Network and disk may be bottleneck for remote or uncached reads 38
  39. 39. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 39
  40. 40. MapReduce • Started with example MR jobs • Wordcount • Grep • 5 node cluster: 4 DNs, 1 NN • Same hardware configuration as single node tests • 38GB HDFS cache per DN • 11 disks per DN • 17GB of Wikipedia text • Small enough to fit into cache at 3x replication • Ran each job 10 times, took the average 40
  41. 41. wordcount and grep 41 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached
  42. 42. wordcount and grep 42 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached Almost no speedup!
  43. 43. wordcount and grep 43 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached ~60MB/s ~330MB/s Not I/O bound
  44. 44. wordcount and grep • End-to-end latency barely changes • These MR jobs are simply not I/O bound! • Best map phase throughput was about 330MB/s • 44 disks can theoretically do 4400MB/s • Further reasoning • Long JVM startup and initialization time • Many copies in TextInputFormat, doesn’t use zero-copy • Caching input data doesn’t help reduce step 44
  45. 45. Introducing bytecount • Trivial version of wordcount • Counts # of occurrences of byte values • Heavily CPU optimized • Each mapper processes an entire block via ZCR • No additional copies • No record slop across block boundaries • Fast inner loop • Very unrealistic job, but serves as a best case • Also tried 2GB block size to amortize startup costs 45
  46. 46. bytecount 46 52 39 35 55 45 58 0 10 20 30 40 50 60 70
  47. 47. bytecount 47 52 39 35 55 45 58 0 10 20 30 40 50 60 70 1.3x faster
  48. 48. bytecount 48 52 39 35 55 45 58 0 10 20 30 40 50 60 70 Still only ~500MB/s
  49. 49. MapReduce Conclusions 49 • Many MR jobs will see marginal improvement • Startup costs • CPU inefficiencies • Shuffle and reduce steps • Even bytecount sees only modest gains • 1.3x faster than disk • 500MB/s with caching and ZCR • Nowhere close to GB/s possible with memory • Needs more work to take full advantage of caching!
  50. 50. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 50
  51. 51. Impala Benchmarks • Open-source OLAP database developed by Cloudera • Tested with Impala 1.3 (CDH 5) • Same 4 DN cluster as MR section • 38GB of 48GB per DN configured as HDFS cache • 152GB aggregate HDFS cache • 11 disks per DN 51
  52. 52. Impala Benchmarks • 1TB TPC-DS store_sales table, text format • count(*) on different numbers of partitions • Has to scan all the data, no skipping • Queries • 51GB small query (34% cache capacity) • 148GB big query (98% cache capacity) • Small query with concurrent workload • Tested “cold” and “hot” • echo 3 > /proc/sys/vm/drop_caches • Lets us compare HDFS caching against page cache 52
  53. 53. Small Query 53 19.8 5.8 4.0 3.0 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s)
  54. 54. Small Query 54 19.8 5.8 4.0 3.0 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 2550 MB/s 17 GB/s I/O bound!
  55. 55. Small Query 55 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 3.4x faster, disk vs. memory
  56. 56. Small Query 56 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 1.3x after warmup, still wins on CPU efficiency
  57. 57. Big Query 57 48.2 11.5 40.9 9.4 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s)
  58. 58. Big Query 58 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 4.2x faster, disk vs mem
  59. 59. Big Query 59 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 4.3x faster, does n’t fit in page cache Cannot schedule for page cache locality
  60. 60. Small Query with Concurrent Workload 60 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s)
  61. 61. Small Query with Concurrent Workload 61 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s) 7x faster when small query working set is cached
  62. 62. Small Query with Concurrent Workload 62 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s) 2x slower than isolated, CPU contention
  63. 63. Impala Conclusions • HDFS cache is faster than disk or page cache • ZCR is more efficient than SCR from page cache • Better when working set is approx. cluster memory • Can schedule tasks for cache locality • Significantly better for concurrent workloads • 7x faster when contending with a single background query • Impala performance will only improve • Many CPU improvements on the roadmap 63
  64. 64. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 64
  65. 65. Future Work • Automatic cache replacement • LRU, LFU, ? • Sub-block caching • Potentially important for automatic cache replacement • Columns in Parquet • Compression, encryption, serialization • Lose many benefits of zero-copy API • Write-side caching • Enables Spark-like RDDs for all HDFS applications 65
  66. 66. Conclusion • I/O contention is a problem for concurrent workloads • HDFS can now explicitly pin working sets into RAM • Applications can place their tasks for cache locality • Use zero-copy API to efficiently read cached data • Substantial performance improvements • 6GB/s for single thread microbenchmark • 7x faster for concurrent Impala workload 66
  67. 67. bytecount 70 52 39 35 55 45 58 0 10 20 30 40 50 60 70 Less disk parallelism
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×