Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
4. @twitterhadoop
Hadoop Workloads @ Twitter, ebay
• Large scale
• Thousands of machines
• Tens of thousands of jobs / day
• Diverse
• Production vs ad-hoc
• Batch vs interactive vs iterative
• Require performance isolation
5. @twitterhadoop
Solutions for Performance Isolation
• YARN: flexible cluster resource management
• Cross Data Center Traffic QoS
• Set QoS policy via DSCP bits in IP header
• HDFS Federation
• Cluster Separation: run high SLA jobs in another
cluster
17. @twitterhadoop
Client Process Namenode Process
RPC Server
RPC Client
DFS Client Namenode Service
Responders
NN Lock
Hadoop RPC Overview
FIFO Call Queue HandlersReaders
47. @twitterhadoop
Current Status
• Enabled on all Twitter and ebay production
clusters for 6+ months
• Open source availability: HADOOP-9640
• Swappable call queue in 2.4
• FairCallQueue in 2.6
• RPC Backoff in 2.8
49. @twitterhadoop
QoS is Easy to Enable
hdfs-site.xml:
!
<property>
<name>ipc.8020.callqueue.impl</name>
<value>org.apache.hadoop.ipc.FairCallQueue</value>
</property>
<property>
<name>ipc.8020.backoff.enable</name>
<value>true</value>
</property>
Port you want QoS on
50. @twitterhadoop
Future Possibilities
• RPC scheduling improvements
• Weighted share per user
• Prioritize datanode RPCs over client RPC
• Overall HDFS QoS
• Namenode fine-grained locking
• Fairness for data transfers
• HTTP based payloads such as webHDFS
51. @twitterhadoop
Conclusion
• Try it out!
• No more namenode congestion since it’s been
enabled at both Twitter and ebay
• Providing QoS at the RPC level is an important
step towards HDFS fine-grained QoS
52. @twitterhadoop
Special thanks to our reviewers:
• Arpit Agarwal (Hortonworks)
• Daryn Sharp (Yahoo)
• Andrew Wang (Cloudera)
• Benoy Antony (ebay)
• Jing Zhao (Hortonworks)
• Hiroshi Ideka (vic.co.jp)
• Eddy Xu (Cloudera)
• Steve Loughran (Hortonworks)
• Suresh Srinivas (Hortonworks)
• Kihwal Lee (Yahoo)
• Joep Rottinghuis (Twitter)
• Lohit VijayaRenu (Twitter)
53. @twitterhadoop
Questions and Answers
• For help setting up QoS, feature ideas, questions:
Ming Ma Chris Li
@twitterhadoop
@mingmasplace
chrili_sf@ebaysf.com
55. @twitterhadoop
FairCallQueue Data
• 37 node cluster
• 10 users runs a job which has:
• 20 Mappers, each mapper:
• Runs 100 threads. Each thread:
• Continuously calls hdfs.exists() in a tight loop
• Spikes are caused by garbage collection, a
separate issue
57. @twitterhadoop
Related JIRAs
• FairCallQueue + Backoff: HADOOP-9640
• Cross Data Center Traffic QoS: HDFS-5175
• nntop: HDFS-6982
• Datanode Congestion Control: HDFS-7270
• Namenode fine-grained locking: HDFS-5453
58. @twitterhadoop
Thoughts on Tuning
• Worth considering if you run a larger cluster or
have many users
• Make your life easier while tuning by refreshing the
queue with hadoop dfsadmin -refreshCallQueue
59. @twitterhadoop
Anatomy of a QoS conf key
• core-site.xml
• ipc.8020.faircallqueue.priority-levels
RPC server’s port, customize if using
non-default port / service rpc port
60. key: default:
@twitterhadoop
Number of Sub-queues
• More subqueues = more unique classes of service
• Recommend 10 for larger clusters
ipc.8020.faircallqueue.priority-levels 4
61. key: default:
@twitterhadoop
Scheduler: Decay Factor
• Controls by how much accumulated counts are
decayed by on each sweep. Larger values decay
slower.
• Ex: 1024 calls with decay factor of 0.5 will take 10
sweeps to decay assuming the user makes no
additional calls.
ipc.8020.faircallqueue.decay-scheduler.decay-factor 0.5
62. key: default:
@twitterhadoop
Scheduler: Sweep Period
• How many ms between each decay sweep. Smaller
is more responsive, but sweeps have overhead.
• Ex: if it takes 10 sweeps to decay and we sweep
every 5 seconds, a user’s activity will remain for
50s.
ipc.8020.faircallqueue.decay-scheduler.period-ms 5000
63. key: default:
@twitterhadoop
Scheduler: Thresholds
• List of floats, determines boundaries between each service class. If you
have 4 queues, you’ll have 3 bounds.
• Each number represents a percentage of total calls.
• First number is threshold for going into queue 0 (highest priority).
Second number decides queue 1 vs rest. etc.
• Recommend trying even splits (10, 20, 30, … 90) or exponential
(default)
ipc.8020.faircallqueue.decay-scheduler.thresholds 12%, 25%, 50%
64. key: default:
@twitterhadoop
Multiplexer: Weights
• Weights are how many times the mux will try to read from a sub-queue it
represents before moving on to the next sub-queue.
• Ex: 4,3,1 is used for 3 queues, meaning: Read up to 4 times from queue
0, Read up to 3 times from queue 1, Read once from queue 2, Repeat
• The mux controls the penalty of being in a low-priority queue.
Recommend not setting anything to 0, as starvation is possible in that
case.
ipc.8020.faircallqueue.multiplexer.weights 8,4,2,1
65. key: default:
@twitterhadoop
Backoff Max Attempts
• The default is equivalent to 90 seconds of retrying
• To achieve equivalent of 10 minutes of retrying, set
it to 44.
dfs.client.retry.max.attempts 10