• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Putting Wings on the Elephant
 

Putting Wings on the Elephant

on

  • 508 views

 

Statistics

Views

Total Views
508
Views on SlideShare
502
Embed Views
6

Actions

Likes
1
Downloads
28
Comments
0

2 Embeds 6

http://www.slideee.com 4
http://geekple.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Putting Wings on the Elephant Putting Wings on the Elephant Presentation Transcript

    • Putting Wings on the Elephant Pritam Damania Facebook, Inc.
    • Putting wings on the Elephant! Pritam Damania Software Engineer April 2, 2014
    • 1 Background 2 Major Issues in I/O path 3 Read Improvements 4 Write Improvements 5 Lessons learnt Agenda
    • High level Messages Architecture HBASE Application Server Messag e Messag e AckWrite
    • Hbase Cluster Physical Layout ▪ Multiple clusters/cells for messaging ▪ 20 servers/rack; 5 or more racks per cluster Rack #1 ZooKeeper Peer HDFS Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #2 ZooKeeper Peer Standby Namenode Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #3 ZooKeeper Peer Job Tracker Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #4 ZooKeeper Peer HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker Rack #5 ZooKeeper Peer Backup HBase Master Region Server Data Node Task Tracker 19x... Region Server Data Node Task Tracker
    • Write Path Overview HDFS Write Ahead Log RegionServer Memstore HFiles
    • HDFS Write Pipeline Datanode OS page cache Disk Regionserver 64k packet Datanode OS page cache Disk Datanode OS page cache Disk Ack
    • Read Path Overview HDFS RegionServer Memstore HFiles Get
    • Problems in R/W Path • Skewed Disk Usage • High Disk iops • High p99 for r/w
    • Improvements in Read Path
    • Disk Skew Datanode OS page cache Disk Datanode OS page cache Disk Datanode OS page cache Disk • HDFS block size : 256MB • HDFS block resides on single disk • Fsync of 256MB hitting single disk
    • Disk Skew - Sync File Range ……………………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB ▪ sync_file_range(SYNC_FILE_RANGE_WRITE) ▪ Initiates Async write Block File Written on Linux FileSystem 64k 64k fsync
    • High IOPS • Messages workload is random read • Small preads (~4KB) on datanodes • Two iops for each pread Datanode Block File Checksum file prea d Read checksum Read data
    • High IOPS - Inline Checksums …………………… ………………………………… 4096 byte Data Chunk 4 byte Checksum • Checksums inline with data • Single iop for pread HDFS Block
    • High IOPS - Results No. of Put and get above one second Put avg time Get avg time
    • Hbase Locality - HDFS Favored Nodes ▪ Each region’s data on 3 specific datanodes ▪ On failure locality preserved ▪ Favored nodes persisted at hbase layer RegionServer Local Datanode
    • Hbase Locality - Solution • Persisting info in NameNode complicated • Region Directory : ▪ /*HBASE/<tablename>/<regionname>/cf1/… ▪ /*HBASE/<tablename>/<regionname>/cf2/… • Build Histogram of locations in directory • Pick lowest frequency to delete 0 5000 10000 Datanodes D1 D2 D3 D4
    • More Improvements • Keep fds open • Throttle re-replication
    • Improvements in Write Path
    • Hbase WAL Datanode OS page cache Disk Regionserver Datanode OS page cache Disk Datanode OS page cache Disk • Packets never hit disk • > 1s outliers !
    • Instrumentation 1. Write to OS cache 2. Write to TCP buffers 3. sync_file_range(SYNC_FILE_RANGE_WRITE) 1. & 3. outliers >1s !
    • Use of strace
    • Interesting Observations • write(2) outliers correlated with busy disk • Reproducible by artificially stressing disk dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000
    • Test Program File Written on Linux FileSystem …………………………………………………………………………………….. 64k 64k 64k 64k sync_file_range every 1MB 64k 64k ……………………………………………………………………………………………….. 63k 1k 63k 1k sync_file_range every 1MB 63k 1k No Outliers ! Outliers Reproduced !
    • Some suspects • Too many dirty pages • Linux stable pages • Kernel trace points revealed stable pages the culprit
    • Stable Pages Persistent Store (Device with Integrity Checking) OS page Kernel Checksum Device Checksum WriteBack • Checksum Error • Solution – Lock pages under writeback
    • Explanation of Write Outliers Persistent Store OS Page 4k WAL write WriteBack (sync_file_range) WAL write blocked
    • Solution ? Patch : http://thread.gmane.org/gmane.comp.file- systems.ext4/35561
    • sync_file_range ? • sync_file_range not async for > 128 write requests • Solution – Use threadpool
    • Results P99 Write latency to OS cache (in ms)
    • Per request profiling • Entire profile of client requests • Full profile of pipeline write • Full profile of pread • Lot of visibility !
    • Interesting Profiles • In memory operations >1s • No Java GC • Co-related with busy root disk • Reproducible by stressing root disk
    • Investigation • Use lsof • /tmp/hsperfdata_hadoop/<pid> suspicious • Disable using -XX:-UsePerfData • Stalls disappeared ! • -XX:-UsePerfData breaks jps, jstack • Mount /tmp/hsperfdata_hadoop/ on tmpfs
    • Result p99 WAL write latency (in ms)
    • Lessons learnt • Instrumentation is key • Per request profiling is very useful • Understanding of Linux kernel and fs is important
    • Acknowledgements ▪ Hairong Kuang ▪ Siying Dong ▪ Kumar Sundararajan ▪ Binu John ▪ Dikang Gu ▪ Paul Tuckfield ▪ Arjen Roodselaar ▪ Matthew Byng-Maddick ▪ Liyin Tang
    • FB Hadoop code • https://github.com/facebook/hadoop-20
    • Questions ?
    • (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0