Putting Wings on the Elephant

1,460 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,460
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
37
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Putting Wings on the Elephant

  1. 1. Putting Wings on the Elephant Pritam Damania Facebook, Inc.
  2. 2. Putting wings on the Elephant! Pritam Damania Software Engineer April 2, 2014
  3. 3. 1 Background 2 Major Issues in I/O path 3 Read Improvements 4 Write Improvements 5 Lessons learnt Agenda
  4. 4. High level Messages Architecture HBASE Application Server Messag e Messag e AckWrite
  5. 5. Hbase Cluster Physical Layout ▪ Multiple clusters/cells for messaging ▪ 20 servers/rack; 5 or more racks per cluster Rack #1 ZooKeeper Peer HDFS Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #2 ZooKeeper Peer Standby Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #3 ZooKeeper Peer Job Tracker Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #4 ZooKeeper Peer HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #5 ZooKeeper Peer Backup HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker
  6. 6. Write Path Overview HDFS Write Ahead Log RegionServer Memstore HFiles
  7. 7. HDFS Write Pipeline Datanode OS page cache Disk Regionserver 64k packet Datanode OS page cache Disk Datanode OS page cache Disk Ack
  8. 8. Read Path Overview HDFS RegionServer Memstore HFiles Get
  9. 9. Problems in R/W Path • Skewed Disk Usage • High Disk iops • High p99 for r/w
  10. 10. Improvements in Read Path
  11. 11. Disk Skew Datanode OS page cache Disk Datanode OS page cache Disk Datanode OS page cache Disk • HDFS block size : 256MB • HDFS block resides on single disk • Fsync of 256MB hitting single disk
  12. 12. Disk Skew - Sync File Range ……………………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB ▪ sync_file_range(SYNC_FILE_RANGE_WRITE) ▪ Initiates Async write Block File Written on Linux FileSystem 64k 64k fsync
  13. 13. High IOPS • Messages workload is random read • Small preads (~4KB) on datanodes • Two iops for each pread Datanode Block File Checksum file prea d Read checksum Read data
  14. 14. High IOPS - Inline Checksums …………………… ………………………………… 4096 byte Data Chunk 4 byte Checksum • Checksums inline with data • Single iop for pread HDFS Block
  15. 15. High IOPS - Results No. of Put and get above one second Put avg time Get avg time
  16. 16. Hbase Locality - HDFS Favored Nodes ▪ Each region’s data on 3 specific datanodes ▪ On failure locality preserved ▪ Favored nodes persisted at hbase layer RegionServer Local Datanode
  17. 17. Hbase Locality - Solution • Persisting info in NameNode complicated • Region Directory : ▪ /*HBASE/<tablename>/<regionname>/cf1/… ▪ /*HBASE/<tablename>/<regionname>/cf2/… • Build Histogram of locations in directory • Pick lowest frequency to delete 0 5000 10000 Datanodes D1 D2 D3 D4
  18. 18. More Improvements • Keep fds open • Throttle re-replication
  19. 19. Improvements in Write Path
  20. 20. Hbase WAL Datanode OS page cache Disk Regionserver Datanode OS page cache Disk Datanode OS page cache Disk • Packets never hit disk • > 1s outliers !
  21. 21. Instrumentation 1. Write to OS cache 2. Write to TCP buffers 3. sync_file_range(SYNC_FILE_RANGE_WRITE) 1. & 3. outliers >1s !
  22. 22. Use of strace
  23. 23. Interesting Observations • write(2) outliers correlated with busy disk • Reproducible by artificially stressing disk dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000
  24. 24. Test Program File Written on Linux FileSystem …………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB 64k 64k ……………………………………………………………………………………………….. 63k 1k 63k 1k sync_file_range every 1MB 63k 1k No Outliers ! Outliers Reproduced !
  25. 25. Some suspects • Too many dirty pages • Linux stable pages • Kernel trace points revealed stable pages the culprit
  26. 26. Stable Pages Persistent Store (Device with Integrity Checking) OS page Kernel Checksum Device Checksum WriteBack • Checksum Error • Solution – Lock pages under writeback
  27. 27. Explanation of Write Outliers Persistent Store OS Page 4k WAL write WriteBack (sync_file_range) WAL write blocked
  28. 28. Solution ? Patch : http://thread.gmane.org/gmane.comp.file- systems.ext4/35561
  29. 29. sync_file_range ? • sync_file_range not async for > 128 write requests • Solution – Use threadpool
  30. 30. Results P99 Write latency to OS cache (in ms)
  31. 31. Per request profiling • Entire profile of client requests • Full profile of pipeline write • Full profile of pread • Lot of visibility !
  32. 32. Interesting Profiles • In memory operations >1s • No Java GC • Co-related with busy root disk • Reproducible by stressing root disk
  33. 33. Investigation • Use lsof • /tmp/hsperfdata_hadoop/<pid> suspicious • Disable using -XX:-UsePerfData • Stalls disappeared ! • -XX:-UsePerfData breaks jps, jstack • Mount /tmp/hsperfdata_hadoop/ on tmpfs
  34. 34. Result p99 WAL write latency (in ms)
  35. 35. Lessons learnt • Instrumentation is key • Per request profiling is very useful • Understanding of Linux kernel and fs is important
  36. 36. Acknowledgements ▪ Hairong Kuang ▪ Siying Dong ▪ Kumar Sundararajan ▪ Binu John ▪ Dikang Gu ▪ Paul Tuckfield ▪ Arjen Roodselaar ▪ Matthew Byng-Maddick ▪ Liyin Tang
  37. 37. FB Hadoop code • https://github.com/facebook/hadoop-20
  38. 38. Questions ?
  39. 39. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×