Putting Wings on the Elephant
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Putting Wings on the Elephant

on

  • 855 views

 

Statistics

Views

Total Views
855
Views on SlideShare
848
Embed Views
7

Actions

Likes
1
Downloads
34
Comments
0

2 Embeds 7

http://www.slideee.com 5
http://geekple.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Putting Wings on the Elephant Presentation Transcript

  • 1. Putting Wings on the Elephant Pritam Damania Facebook, Inc.
  • 2. Putting wings on the Elephant! Pritam Damania Software Engineer April 2, 2014
  • 3. 1 Background 2 Major Issues in I/O path 3 Read Improvements 4 Write Improvements 5 Lessons learnt Agenda
  • 4. High level Messages Architecture HBASE Application Server Messag e Messag e AckWrite
  • 5. Hbase Cluster Physical Layout ▪ Multiple clusters/cells for messaging ▪ 20 servers/rack; 5 or more racks per cluster Rack #1 ZooKeeper Peer HDFS Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #2 ZooKeeper Peer Standby Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #3 ZooKeeper Peer Job Tracker Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #4 ZooKeeper Peer HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #5 ZooKeeper Peer Backup HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker
  • 6. Write Path Overview HDFS Write Ahead Log RegionServer Memstore HFiles
  • 7. HDFS Write Pipeline Datanode OS page cache Disk Regionserver 64k packet Datanode OS page cache Disk Datanode OS page cache Disk Ack
  • 8. Read Path Overview HDFS RegionServer Memstore HFiles Get
  • 9. Problems in R/W Path • Skewed Disk Usage • High Disk iops • High p99 for r/w
  • 10. Improvements in Read Path
  • 11. Disk Skew Datanode OS page cache Disk Datanode OS page cache Disk Datanode OS page cache Disk • HDFS block size : 256MB • HDFS block resides on single disk • Fsync of 256MB hitting single disk
  • 12. Disk Skew - Sync File Range ……………………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB ▪ sync_file_range(SYNC_FILE_RANGE_WRITE) ▪ Initiates Async write Block File Written on Linux FileSystem 64k 64k fsync
  • 13. High IOPS • Messages workload is random read • Small preads (~4KB) on datanodes • Two iops for each pread Datanode Block File Checksum file prea d Read checksum Read data
  • 14. High IOPS - Inline Checksums …………………… ………………………………… 4096 byte Data Chunk 4 byte Checksum • Checksums inline with data • Single iop for pread HDFS Block
  • 15. High IOPS - Results No. of Put and get above one second Put avg time Get avg time
  • 16. Hbase Locality - HDFS Favored Nodes ▪ Each region’s data on 3 specific datanodes ▪ On failure locality preserved ▪ Favored nodes persisted at hbase layer RegionServer Local Datanode
  • 17. Hbase Locality - Solution • Persisting info in NameNode complicated • Region Directory : ▪ /*HBASE/<tablename>/<regionname>/cf1/… ▪ /*HBASE/<tablename>/<regionname>/cf2/… • Build Histogram of locations in directory • Pick lowest frequency to delete 0 5000 10000 Datanodes D1 D2 D3 D4
  • 18. More Improvements • Keep fds open • Throttle re-replication
  • 19. Improvements in Write Path
  • 20. Hbase WAL Datanode OS page cache Disk Regionserver Datanode OS page cache Disk Datanode OS page cache Disk • Packets never hit disk • > 1s outliers !
  • 21. Instrumentation 1. Write to OS cache 2. Write to TCP buffers 3. sync_file_range(SYNC_FILE_RANGE_WRITE) 1. & 3. outliers >1s !
  • 22. Use of strace
  • 23. Interesting Observations • write(2) outliers correlated with busy disk • Reproducible by artificially stressing disk dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000
  • 24. Test Program File Written on Linux FileSystem …………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB 64k 64k ……………………………………………………………………………………………….. 63k 1k 63k 1k sync_file_range every 1MB 63k 1k No Outliers ! Outliers Reproduced !
  • 25. Some suspects • Too many dirty pages • Linux stable pages • Kernel trace points revealed stable pages the culprit
  • 26. Stable Pages Persistent Store (Device with Integrity Checking) OS page Kernel Checksum Device Checksum WriteBack • Checksum Error • Solution – Lock pages under writeback
  • 27. Explanation of Write Outliers Persistent Store OS Page 4k WAL write WriteBack (sync_file_range) WAL write blocked
  • 28. Solution ? Patch : http://thread.gmane.org/gmane.comp.file- systems.ext4/35561
  • 29. sync_file_range ? • sync_file_range not async for > 128 write requests • Solution – Use threadpool
  • 30. Results P99 Write latency to OS cache (in ms)
  • 31. Per request profiling • Entire profile of client requests • Full profile of pipeline write • Full profile of pread • Lot of visibility !
  • 32. Interesting Profiles • In memory operations >1s • No Java GC • Co-related with busy root disk • Reproducible by stressing root disk
  • 33. Investigation • Use lsof • /tmp/hsperfdata_hadoop/<pid> suspicious • Disable using -XX:-UsePerfData • Stalls disappeared ! • -XX:-UsePerfData breaks jps, jstack • Mount /tmp/hsperfdata_hadoop/ on tmpfs
  • 34. Result p99 WAL write latency (in ms)
  • 35. Lessons learnt • Instrumentation is key • Per request profiling is very useful • Understanding of Linux kernel and fs is important
  • 36. Acknowledgements ▪ Hairong Kuang ▪ Siying Dong ▪ Kumar Sundararajan ▪ Binu John ▪ Dikang Gu ▪ Paul Tuckfield ▪ Arjen Roodselaar ▪ Matthew Byng-Maddick ▪ Liyin Tang
  • 37. FB Hadoop code • https://github.com/facebook/hadoop-20
  • 38. Questions ?
  • 39. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0