• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 

In-memory Caching in HDFS: Lower Latency, Same Great Taste

on

  • 1,215 views

 

Statistics

Views

Total Views
1,215
Views on SlideShare
931
Embed Views
284

Actions

Likes
2
Downloads
55
Comments
0

2 Embeds 284

http://www.scoop.it 282
https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    In-memory Caching in HDFS: Lower Latency, Same Great Taste In-memory Caching in HDFS: Lower Latency, Same Great Taste Presentation Transcript

    • In-memory Caching in HDFS: Lower Latency, Same Great Taste Andrew Wang and Colin McCabe Cloudera
    • 2 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com
    • Alice Hadoop cluster Query Result set
    • Alice Fresh data
    • Fresh data
    • Alice Rollup
    • Problems • Data hotspots • Everyone wants to query some fresh data • Shared disks are unable to handle high load • Mixed workloads • Data analyst making small point queries • Rollup job scanning all the data • Point query latency suffers because of I/O contention • Same theme: disk I/O contention! 7
    • How do we solve I/O issues? • Cache important datasets in memory! • Much higher throughput than disk • Fast random/concurrent access • Interesting working sets often fit in cluster memory • Traces from Facebook’s Hive cluster • Increasingly affordable to buy a lot of memory • Moore’s law • 1TB RAM server is 40k on HP’s website 8
    • Alice Page cache
    • Alice Repeated query ?
    • Alice Rollup
    • Alice Extra copies
    • Alice Checksum verification Extra copies
    • Design Considerations 1. Placing tasks for memory locality • Expose cache locations to application schedulers 2. Contention for page cache from other users • Explicitly pin hot datasets in memory 3. Extra copies when reading cached data • Zero-copy API to read cached data 14
    • Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 15
    • Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 16
    • Cache Directives • A cache directive describes a file or directory that should be cached • Path • Cache replication factor: 1-N • Stored permanently on the NameNode • Also have cache pools for access control and quotas, but we won’t be covering that here 17
    • Architecture 18 DataNode DataNode DataNode NameNode DFSClient Cache /foo Cache commandsCache Heartbeats DFSClient getBlockLocations
    • mlock • The DataNode pins each cached block into the page cache using mlock, and checksums it. • Because we’re using the page cache, the blocks don’t take up any space on the Java heap. 19 DataNode Page Cache DFSClient read mlock
    • Zero-copy read API • Clients can use the zero-copy read API to map the cached replica into their own address space • The zero-copy API avoids the overhead of the read() and pread() system calls • However, we don’t verify checksums when using the zero-copy API • The zero-copy API can be only used on cached data, or when the application computes its own checksums. 20
    • Zero-copy read API New FSDataInputStream methods: ByteBuffer read(ByteBufferPool pool, int maxLength, EnumSet<ReadOption> opts); void releaseBuffer(ByteBuffer buffer); 21
    • Skipping Checksums • We would like to skip checksum verification when reading cached data • DataNode already checksums when caching the block • Enables more efficient SCR, ZCR • Requirements • Client needs to know that the replica is cached • DataNode needs to notify the client if the replica is uncached 22
    • Skipping Checksums • The DataNode and DFSClient use shared memory segments to communicate which blocks are cached. 23 DataNode Page Cache DFSClient read mlock Shared Memory Segment
    • Skipping Checksums 24 Block 123 DataNode DFSClient Can Skip Csums In Use Zero-Copy MappedByteBuffer
    • Skipping Checksums 25 Block 123 DataNode DFSClient Can Skip Csums In Use Zero-Copy MappedByteBuffer
    • Architecture Summary • The Cache Directive API provides per-file control over what is cached • The NameNode tracks cached blocks and coordinates DataNode cache work • The DataNodes use mlock to lock page cache blocks into memory • The DFSClient can determine whether it is safe to skip checksums via the shared memory segment • Caching makes it possible to use the efficient Zero- Copy API on cached data 26
    • Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Single-Node Microbenchmarks • MapReduce • Impala • Future work 27
    • Test Node • 48GB of RAM • Configured 38GB of HDFS cache • 11x SATA hard disks • 2x4 core 2.13 GHz Westmere Xeon processors • 10 Gbit/s full-bisection bandwidth network 28
    • Single-Node Microbenchmarks • How much faster are cached and zero-copy reads? • Introducing vecsum (vector sum) • Computes sums of a file of doubles • Highly optimized: uses SSE intrinsics • libhdfs program • Can toggle between various read methods • Terminology • SCR: short-circuit reads • ZCR: zero-copy reads 29
    • Throughput Reading 1G File 20x 30 0.8 0.9 1.9 2.4 5.9 0 1 2 3 4 5 6 7 TCP TCP no csums SCR SCR no csums ZCR GB/s
    • ZCR 1GB vs 20GB 31 5.9 2.7 0 1 2 3 4 5 6 7 1GB 20GB GB/s
    • Throughput • Skipping checksums matters more when going faster • ZCR gets close to bus bandwidth • ~6GB/s • Need to reuse client-side mmaps for maximum perf • page_fault function is 1.16% of cycles in 1G • 17.55% in 20G 32
    • Client CPU cycles 33 57.6 51.8 27.1 23.4 12.7 0 10 20 30 40 50 60 70 TCP TCP no csums SCR SCR no csums ZCR CPUcycles(billions)
    • Why is ZCR more CPU-efficient? 34
    • Why is ZCR more CPU-efficient? 35
    • Remote Cached vs. Local Uncached • Zero-copy is only possible for local cached data • Is it better to read from remote cache, or local disk? 36
    • Remote Cached vs. Local Uncached 37 841 1092 125 137 0 200 400 600 800 1000 1200 TCP iperf SCR dd MB/s
    • Microbenchmark Conclusions • Short-circuit reads need less CPU than TCP reads • ZCR is even more efficient, because it avoids a copy • ZCR goes much faster when re-reading the same data, because it can avoid mmap page faults • Network and disk may be bottleneck for remote or uncached reads 38
    • Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 39
    • MapReduce • Started with example MR jobs • Wordcount • Grep • 5 node cluster: 4 DNs, 1 NN • Same hardware configuration as single node tests • 38GB HDFS cache per DN • 11 disks per DN • 17GB of Wikipedia text • Small enough to fit into cache at 3x replication • Ran each job 10 times, took the average 40
    • wordcount and grep 41 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached
    • wordcount and grep 42 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached Almost no speedup!
    • wordcount and grep 43 275 52 280 55 0 50 100 150 200 250 300 350 400 wordcount wordcount cached grep grep cached ~60MB/s ~330MB/s Not I/O bound
    • wordcount and grep • End-to-end latency barely changes • These MR jobs are simply not I/O bound! • Best map phase throughput was about 330MB/s • 44 disks can theoretically do 4400MB/s • Further reasoning • Long JVM startup and initialization time • Many copies in TextInputFormat, doesn’t use zero-copy • Caching input data doesn’t help reduce step 44
    • Introducing bytecount • Trivial version of wordcount • Counts # of occurrences of byte values • Heavily CPU optimized • Each mapper processes an entire block via ZCR • No additional copies • No record slop across block boundaries • Fast inner loop • Very unrealistic job, but serves as a best case • Also tried 2GB block size to amortize startup costs 45
    • bytecount 46 52 39 35 55 45 58 0 10 20 30 40 50 60 70
    • bytecount 47 52 39 35 55 45 58 0 10 20 30 40 50 60 70 1.3x faster
    • bytecount 48 52 39 35 55 45 58 0 10 20 30 40 50 60 70 Still only ~500MB/s
    • MapReduce Conclusions 49 • Many MR jobs will see marginal improvement • Startup costs • CPU inefficiencies • Shuffle and reduce steps • Even bytecount sees only modest gains • 1.3x faster than disk • 500MB/s with caching and ZCR • Nowhere close to GB/s possible with memory • Needs more work to take full advantage of caching!
    • Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 50
    • Impala Benchmarks • Open-source OLAP database developed by Cloudera • Tested with Impala 1.3 (CDH 5) • Same 4 DN cluster as MR section • 38GB of 48GB per DN configured as HDFS cache • 152GB aggregate HDFS cache • 11 disks per DN 51
    • Impala Benchmarks • 1TB TPC-DS store_sales table, text format • count(*) on different numbers of partitions • Has to scan all the data, no skipping • Queries • 51GB small query (34% cache capacity) • 148GB big query (98% cache capacity) • Small query with concurrent workload • Tested “cold” and “hot” • echo 3 > /proc/sys/vm/drop_caches • Lets us compare HDFS caching against page cache 52
    • Small Query 53 19.8 5.8 4.0 3.0 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s)
    • Small Query 54 19.8 5.8 4.0 3.0 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 2550 MB/s 17 GB/s I/O bound!
    • Small Query 55 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 3.4x faster, disk vs. memory
    • Small Query 56 0 5 10 15 20 25 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 1.3x after warmup, still wins on CPU efficiency
    • Big Query 57 48.2 11.5 40.9 9.4 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s)
    • Big Query 58 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 4.2x faster, disk vs mem
    • Big Query 59 0 10 20 30 40 50 60 Uncached cold Cached cold Uncached hot Cached hot Averageresponsetime(s) 4.3x faster, does n’t fit in page cache Cannot schedule for page cache locality
    • Small Query with Concurrent Workload 60 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s)
    • Small Query with Concurrent Workload 61 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s) 7x faster when small query working set is cached
    • Small Query with Concurrent Workload 62 0 10 20 30 40 50 60 Uncached Cached Cached (not concurrent) Averageresponsetime(s) 2x slower than isolated, CPU contention
    • Impala Conclusions • HDFS cache is faster than disk or page cache • ZCR is more efficient than SCR from page cache • Better when working set is approx. cluster memory • Can schedule tasks for cache locality • Significantly better for concurrent workloads • 7x faster when contending with a single background query • Impala performance will only improve • Many CPU improvements on the roadmap 63
    • Outline • Implementation • NameNode and DataNode modifications • Zero-copy read API • Evaluation • Microbenchmarks • MapReduce • Impala • Future work 64
    • Future Work • Automatic cache replacement • LRU, LFU, ? • Sub-block caching • Potentially important for automatic cache replacement • Columns in Parquet • Compression, encryption, serialization • Lose many benefits of zero-copy API • Write-side caching • Enables Spark-like RDDs for all HDFS applications 65
    • Conclusion • I/O contention is a problem for concurrent workloads • HDFS can now explicitly pin working sets into RAM • Applications can place their tasks for cache locality • Use zero-copy API to efficiently read cached data • Substantial performance improvements • 6GB/s for single thread microbenchmark • 7x faster for concurrent Impala workload 66
    • bytecount 70 52 39 35 55 45 58 0 10 20 30 40 50 60 70 Less disk parallelism