In-memory Caching in HDFS: Lower Latency, Same Great Taste
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

In-memory Caching in HDFS: Lower Latency, Same Great Taste

on

  • 1,896 views

 

Statistics

Views

Total Views
1,896
Views on SlideShare
1,611
Embed Views
285

Actions

Likes
6
Downloads
76
Comments
0

2 Embeds 285

http://www.scoop.it 283
https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

In-memory Caching in HDFS: Lower Latency, Same Great Taste Presentation Transcript

  • 1. In-memory Caching in HDFS: Lower Latency, Same Great Taste Andrew Wang and Colin McCabe Cloudera
  • 2. 2 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com
  • 3. Alice Hadoop cluster Query Result set
  • 4. Alice Fresh data
  • 5. Fresh data
  • 6. Alice Rollup
  • 7. Problems • Data hotspots • Everyone wants to query some fresh data • Shared disks are unable to handle high load • Mixed workloads • Data analyst making small point queries • Rollup job scanning all the data • Point query latency suffers because of I/O contention • Same theme: disk I/O contention! 7
  • 8. How do we solve I/O issues? • Cache important datasets in memory! • Much higher throughput than disk • Fast random/concurrent access • Interesting working sets often fit in cluster memory • Traces from Facebook’s Hive cluster • Increasingly affordable to buy a lot of memory • Moore’s law • 1TB RAM server is 40k on HP’s website 8
  • 9. Alice Page cache
  • 10. Alice Repeated query ?
  • 11. Alice Rollup
  • 12. Alice Extra copies
  • 13. Alice Checksum verification Extra copies
  • 14. Design Considerations 1. Placing tasks for memory locality • Expose cache locations to application schedulers 2. Contention for page cache from other users • Explicitly pin hot datasets in memory 3. Extra copies when reading cached data • Zero-copy API to read cached data 14
  • 15. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 15
  • 16. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 16
  • 17. Cache Directives • A cache directive describes a file or directory that should be cached • Path • Cache replication factor: 1-N • Stored permanently on the NameNode • Also have cache pools for access control and quotas, but we won’t be covering that here 17
  • 18. Architecture 18 DataNode DataNode DataNode NameNode DFSClient Cache /foo Cache commandsCache Heartbeats DFSClient getBlockLocations
  • 19. mlock • The DataNode pins each cached block into the page cache using mlock, and checksums it. • Because we’re using the page cache, the blocks don’t take up any space on the Java heap. 19 DataNode Page Cache DFSClient read mlock
  • 20. Zero-copy read API • Clients can use the zero-copy read API to map the cached replica into their own address space • The zero-copy API avoids the overhead of the read() and pread() system calls • However, we don’t verify checksums when using the zero-copy API • The zero-copy API can be only used on cached data, or when the application computes its own checksums. 20
  • 21. Zero-copy read API New FSDataInputStream methods: ByteBuffer read(ByteBufferPool pool, int maxLength, EnumSet<ReadOption> opts); void releaseBuffer(ByteBuffer buffer); 21
  • 22. Skipping Checksums • We would like to skip checksum verification when reading cached data • DataNode already checksums when caching the block • Enables more efficient SCR, ZCR • Requirements • Client needs to know that the replica is cached • DataNode needs to notify the client if the replica is uncached 22
  • 23. Skipping Checksums • The DataNode and DFSClient use shared memory segments to communicate which blocks are cached. 23 DataNode Page Cache DFSClient read mlock Shared Memory Segment
  • 24. Skipping Checksums 24 Block 123 DataNode DFSClient Can Skip Csums In Use Zero-Copy MappedByteBuffer
  • 25. Skipping Checksums 25 Block 123 DataNode DFSClient Can Skip Csums In Use Zero-Copy MappedByteBuffer
  • 26. Architecture Summary • The Cache Directive API provides per-file control over what is cached • The NameNode tracks cached blocks and coordinates DataNode cache work • The DataNodes use mlock to lock page cache blocks into memory • The DFSClient can determine whether it is safe to skip checksums via the shared memory segment • Caching makes it possible to use the efficient Zero- Copy API on cached data 26
  • 27. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Single-Node Microbenchmarks • MapReduce • Impala • Future work 27
  • 28. Test Node • 48GB of RAM • Configured 38GB of HDFS cache • 11x SATA hard disks • 2x4 core 2.13 GHz Westmere Xeon processors • 10 Gbit/s full-bisection bandwidth network 28
  • 29. Single-Node Microbenchmarks • How much faster are cached and zero-copy reads? • Introducing vecsum (vector sum) • Computes sums of a file of doubles • Highly optimized: uses SSE intrinsics • libhdfs program • Can toggle between various read methods • Terminology • SCR: short-circuit reads • ZCR: zero-copy reads 29
  • 30. Throughput Reading 1G File 20x 30 0.8 0.9 1.9 2.4 5.9 0 1 2 3 4 5 6 7 TCP TCP no csums SCR SCR no csums ZCR GB/s
  • 31. ZCR 1GB vs 20GB 31 5.9 2.7 0 1 2 3 4 5 6 7 1GB 20GB GB/s
  • 32. Throughput • Skipping checksums matters more when going faster • ZCR gets close to bus bandwidth • ~6GB/s • Need to reuse client-side mmaps for maximum perf • page_fault function is 1.16% of cycles in 1G • 17.55% in 20G 32
  • 33. Client CPU cycles 33 57.6 51.8 27.1 23.4 12.7 0 10 20 30 40 50 60 70 TCP TCP no csums SCR SCR no csums ZCR CPUcycles(billions)
  • 34. Why is ZCR more CPU-efficient? 34
  • 35. Why is ZCR more CPU-efficient? 35
  • 36. Remote Cached vs. Local Uncached • Zero-copy is only possible for local cached data • Is it better to read from remote cache, or local disk? 36
  • 37. Remote Cached vs. Local Uncached 37 841 1092 125 137 0 200 400 600 800 1000 1200 TCP iperf SCR dd MB/s
  • 38. Microbenchmark Conclusions • Short-circuit reads need less CPU than TCP reads • ZCR is even more efficient, because it avoids a copy • ZCR goes much faster when re-reading the same data, because it can avoid mmap page faults • Network and disk may be bottleneck for remote or uncached reads 38
  • 39. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 39
  • 40. MapReduce • Started with example MR jobs • Wordcount • Grep • 5 node cluster: 4 DNs, 1 NN • Same hardware configuration as single node tests • 38GB HDFS cache per DN • 11 disks per DN • 17GB of Wikipedia text • Small enough to fit into cache at 3x replication • Ran each job 10 times, took the average 40
  • 41. wordcount and grep 41 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached
  • 42. wordcount and grep 42 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached Almost no speedup!
  • 43. wordcount and grep 43 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached ~60MB/s ~330MB/s Not I/O bound
  • 44. wordcount and grep • End-to-end latency barely changes • These MR jobs are simply not I/O bound! • Best map phase throughput was about 330MB/s • 44 disks can theoretically do 4400MB/s • Further reasoning • Long JVM startup and initialization time • Many copies in TextInputFormat, doesn’t use zero-copy • Caching input data doesn’t help reduce step 44
  • 45. Introducing bytecount • Trivial version of wordcount • Counts # of occurrences of byte values • Heavily CPU optimized • Each mapper processes an entire block via ZCR • No additional copies • No record slop across block boundaries • Fast inner loop • Very unrealistic job, but serves as a best case • Also tried 2GB block size to amortize startup costs 45
  • 46. bytecount 46 52 39 35 55 45 58 0 10 20 30 40 50 60 70
  • 47. bytecount 47 52 39 35 55 45 58 0 10 20 30 40 50 60 70 1.3x faster
  • 48. bytecount 48 52 39 35 55 45 58 0 10 20 30 40 50 60 70 Still only ~500MB/s
  • 49. MapReduce Conclusions 49 • Many MR jobs will see marginal improvement • Startup costs • CPU inefficiencies • Shuffle and reduce steps • Even bytecount sees only modest gains • 1.3x faster than disk • 500MB/s with caching and ZCR • Nowhere close to GB/s possible with memory • Needs more work to take full advantage of caching!
  • 50. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 50
  • 51. Impala Benchmarks • Open-source OLAP database developed by Cloudera • Tested with Impala 1.3 (CDH 5) • Same 4 DN cluster as MR section • 38GB of 48GB per DN configured as HDFS cache • 152GB aggregate HDFS cache • 11 disks per DN 51
  • 52. Impala Benchmarks • 1TB TPC-DS store_sales table, text format • count(*) on different numbers of partitions • Has to scan all the data, no skipping • Queries • 51GB small query (34% cache capacity) • 148GB big query (98% cache capacity) • Small query with concurrent workload • Tested “cold” and “hot” • echo 3 > /proc/sys/vm/drop_caches • Lets us compare HDFS caching against page cache 52
  • 53. Small Query 53 19.8 5.8 4.0 3.0 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s)
  • 54. Small Query 54 19.8 5.8 4.0 3.0 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 2550 MB/s 17 GB/s I/O bound!
  • 55. Small Query 55 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 3.4x faster, disk vs. memory
  • 56. Small Query 56 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 1.3x after warmup, still wins on CPU efficiency
  • 57. Big Query 57 48.2 11.5 40.9 9.4 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s)
  • 58. Big Query 58 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 4.2x faster, disk vs mem
  • 59. Big Query 59 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 4.3x faster, does n’t fit in page cache Cannot schedule for page cache locality
  • 60. Small Query with Concurrent Workload 60 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s)
  • 61. Small Query with Concurrent Workload 61 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s) 7x faster when small query working set is cached
  • 62. Small Query with Concurrent Workload 62 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s) 2x slower than isolated, CPU contention
  • 63. Impala Conclusions • HDFS cache is faster than disk or page cache • ZCR is more efficient than SCR from page cache • Better when working set is approx. cluster memory • Can schedule tasks for cache locality • Significantly better for concurrent workloads • 7x faster when contending with a single background query • Impala performance will only improve • Many CPU improvements on the roadmap 63
  • 64. Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 64
  • 65. Future Work • Automatic cache replacement • LRU, LFU, ? • Sub-block caching • Potentially important for automatic cache replacement • Columns in Parquet • Compression, encryption, serialization • Lose many benefits of zero-copy API • Write-side caching • Enables Spark-like RDDs for all HDFS applications 65
  • 66. Conclusion • I/O contention is a problem for concurrent workloads • HDFS can now explicitly pin working sets into RAM • Applications can place their tasks for cache locality • Use zero-copy API to efficiently read cached data • Substantial performance improvements • 6GB/s for single thread microbenchmark • 7x faster for concurrent Impala workload 66
  • 67. bytecount 70 52 39 35 55 45 58 0 10 20 30 40 50 60 70 Less disk parallelism