Putting Wings on the Elephant
Pritam Damania
Facebook, Inc.
Putting wings on the Elephant!
Pritam Damania
Software Engineer
April 2, 2014
1 Background
2 Major Issues in I/O path
3 Read Improvements
4 Write Improvements
5 Lessons learnt
Agenda
High level Messages Architecture
HBASE
Application
Server
Messag
e
Messag
e
AckWrite
Hbase Cluster Physical Layout
▪ Multiple clusters/cells for messaging
▪ 20 servers/rack; 5 or more racks per cluster
Rack ...
Write Path Overview
HDFS
Write Ahead
Log
RegionServer
Memstore
HFiles
HDFS Write Pipeline
Datanode
OS page cache
Disk
Regionserver
64k
packet
Datanode
OS page cache
Disk
Datanode
OS page cache...
Read Path Overview
HDFS
RegionServer
Memstore
HFiles
Get
Problems in R/W Path
• Skewed Disk Usage
• High Disk iops
• High p99 for r/w
Improvements in Read Path
Disk Skew
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
• HDFS block size : 256MB
• ...
Disk Skew - Sync File Range
………………………………………………………………………………………………..
64k 64k 64k 64k
sync_file_range every 1MB
▪ sync_file_r...
High IOPS
• Messages workload is random read
• Small preads (~4KB) on datanodes
• Two iops for each pread
Datanode
Block F...
High IOPS - Inline Checksums
……………………
…………………………………
4096 byte Data Chunk
4 byte Checksum
• Checksums inline with data
• Si...
High IOPS - Results
No. of Put
and get
above one
second
Put
avg
time
Get
avg
time
Hbase Locality - HDFS Favored Nodes
▪ Each region’s data on 3 specific datanodes
▪ On failure locality preserved
▪ Favored...
Hbase Locality - Solution
• Persisting info in NameNode complicated
• Region Directory :
▪ /*HBASE/<tablename>/<regionname...
More Improvements
• Keep fds open
• Throttle re-replication
Improvements in Write Path
Hbase WAL
Datanode
OS page cache
Disk
Regionserver
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
• Packets never...
Instrumentation
1. Write to OS cache
2. Write to TCP buffers
3. sync_file_range(SYNC_FILE_RANGE_WRITE)
1. & 3. outliers >1...
Use of strace
Interesting Observations
• write(2) outliers correlated with busy disk
• Reproducible by artificially stressing disk
dd of...
Test Program
File Written on
Linux FileSystem
……………………………………………………………………………………..
64k 64k 64k 64k
sync_file_range every 1MB...
Some suspects
• Too many dirty pages
• Linux stable pages
• Kernel trace points revealed stable pages the culprit
Stable Pages
Persistent Store
(Device with Integrity Checking)
OS page
Kernel
Checksum
Device
Checksum
WriteBack
• Checksu...
Explanation of Write Outliers
Persistent Store
OS Page
4k
WAL write
WriteBack
(sync_file_range)
WAL write
blocked
Solution ?
Patch :
http://thread.gmane.org/gmane.comp.file-
systems.ext4/35561
sync_file_range ?
• sync_file_range not async for > 128 write requests
• Solution – Use threadpool
Results
P99
Write
latency to
OS
cache (in
ms)
Per request profiling
• Entire profile of client requests
• Full profile of pipeline write
• Full profile of pread
• Lot o...
Interesting Profiles
• In memory operations >1s
• No Java GC
• Co-related with busy root disk
• Reproducible by stressing ...
Investigation
• Use lsof
• /tmp/hsperfdata_hadoop/<pid> suspicious
• Disable using -XX:-UsePerfData
• Stalls disappeared !...
Result
p99
WAL
write
latency
(in ms)
Lessons learnt
• Instrumentation is key
• Per request profiling is very useful
• Understanding of Linux kernel and fs is i...
Acknowledgements
▪ Hairong Kuang
▪ Siying Dong
▪ Kumar Sundararajan
▪ Binu John
▪ Dikang Gu
▪ Paul Tuckfield
▪ Arjen Roods...
FB Hadoop code
• https://github.com/facebook/hadoop-20
Questions ?
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Putting Wings on the Elephant
Putting Wings on the Elephant
Putting Wings on the Elephant
Upcoming SlideShare
Loading in...5
×

Putting Wings on the Elephant

1,134

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,134
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
37
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Putting Wings on the Elephant

  1. 1. Putting Wings on the Elephant Pritam Damania Facebook, Inc.
  2. 2. Putting wings on the Elephant! Pritam Damania Software Engineer April 2, 2014
  3. 3. 1 Background 2 Major Issues in I/O path 3 Read Improvements 4 Write Improvements 5 Lessons learnt Agenda
  4. 4. High level Messages Architecture HBASE Application Server Messag e Messag e AckWrite
  5. 5. Hbase Cluster Physical Layout ▪ Multiple clusters/cells for messaging ▪ 20 servers/rack; 5 or more racks per cluster Rack #1 ZooKeeper Peer HDFS Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #2 ZooKeeper Peer Standby Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #3 ZooKeeper Peer Job Tracker Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #4 ZooKeeper Peer HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #5 ZooKeeper Peer Backup HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker
  6. 6. Write Path Overview HDFS Write Ahead Log RegionServer Memstore HFiles
  7. 7. HDFS Write Pipeline Datanode OS page cache Disk Regionserver 64k packet Datanode OS page cache Disk Datanode OS page cache Disk Ack
  8. 8. Read Path Overview HDFS RegionServer Memstore HFiles Get
  9. 9. Problems in R/W Path • Skewed Disk Usage • High Disk iops • High p99 for r/w
  10. 10. Improvements in Read Path
  11. 11. Disk Skew Datanode OS page cache Disk Datanode OS page cache Disk Datanode OS page cache Disk • HDFS block size : 256MB • HDFS block resides on single disk • Fsync of 256MB hitting single disk
  12. 12. Disk Skew - Sync File Range ……………………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB ▪ sync_file_range(SYNC_FILE_RANGE_WRITE) ▪ Initiates Async write Block File Written on Linux FileSystem 64k 64k fsync
  13. 13. High IOPS • Messages workload is random read • Small preads (~4KB) on datanodes • Two iops for each pread Datanode Block File Checksum file prea d Read checksum Read data
  14. 14. High IOPS - Inline Checksums …………………… ………………………………… 4096 byte Data Chunk 4 byte Checksum • Checksums inline with data • Single iop for pread HDFS Block
  15. 15. High IOPS - Results No. of Put and get above one second Put avg time Get avg time
  16. 16. Hbase Locality - HDFS Favored Nodes ▪ Each region’s data on 3 specific datanodes ▪ On failure locality preserved ▪ Favored nodes persisted at hbase layer RegionServer Local Datanode
  17. 17. Hbase Locality - Solution • Persisting info in NameNode complicated • Region Directory : ▪ /*HBASE/<tablename>/<regionname>/cf1/… ▪ /*HBASE/<tablename>/<regionname>/cf2/… • Build Histogram of locations in directory • Pick lowest frequency to delete 0 5000 10000 Datanodes D1 D2 D3 D4
  18. 18. More Improvements • Keep fds open • Throttle re-replication
  19. 19. Improvements in Write Path
  20. 20. Hbase WAL Datanode OS page cache Disk Regionserver Datanode OS page cache Disk Datanode OS page cache Disk • Packets never hit disk • > 1s outliers !
  21. 21. Instrumentation 1. Write to OS cache 2. Write to TCP buffers 3. sync_file_range(SYNC_FILE_RANGE_WRITE) 1. & 3. outliers >1s !
  22. 22. Use of strace
  23. 23. Interesting Observations • write(2) outliers correlated with busy disk • Reproducible by artificially stressing disk dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000
  24. 24. Test Program File Written on Linux FileSystem …………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB 64k 64k ……………………………………………………………………………………………….. 63k 1k 63k 1k sync_file_range every 1MB 63k 1k No Outliers ! Outliers Reproduced !
  25. 25. Some suspects • Too many dirty pages • Linux stable pages • Kernel trace points revealed stable pages the culprit
  26. 26. Stable Pages Persistent Store (Device with Integrity Checking) OS page Kernel Checksum Device Checksum WriteBack • Checksum Error • Solution – Lock pages under writeback
  27. 27. Explanation of Write Outliers Persistent Store OS Page 4k WAL write WriteBack (sync_file_range) WAL write blocked
  28. 28. Solution ? Patch : http://thread.gmane.org/gmane.comp.file- systems.ext4/35561
  29. 29. sync_file_range ? • sync_file_range not async for > 128 write requests • Solution – Use threadpool
  30. 30. Results P99 Write latency to OS cache (in ms)
  31. 31. Per request profiling • Entire profile of client requests • Full profile of pipeline write • Full profile of pread • Lot of visibility !
  32. 32. Interesting Profiles • In memory operations >1s • No Java GC • Co-related with busy root disk • Reproducible by stressing root disk
  33. 33. Investigation • Use lsof • /tmp/hsperfdata_hadoop/<pid> suspicious • Disable using -XX:-UsePerfData • Stalls disappeared ! • -XX:-UsePerfData breaks jps, jstack • Mount /tmp/hsperfdata_hadoop/ on tmpfs
  34. 34. Result p99 WAL write latency (in ms)
  35. 35. Lessons learnt • Instrumentation is key • Per request profiling is very useful • Understanding of Linux kernel and fs is important
  36. 36. Acknowledgements ▪ Hairong Kuang ▪ Siying Dong ▪ Kumar Sundararajan ▪ Binu John ▪ Dikang Gu ▪ Paul Tuckfield ▪ Arjen Roodselaar ▪ Matthew Byng-Maddick ▪ Liyin Tang
  37. 37. FB Hadoop code • https://github.com/facebook/hadoop-20
  38. 38. Questions ?
  39. 39. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×