New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Collection of Small Tips on Further Stabilizing your Hadoop Cluster
1. Collection of small tips on further
stabilizing your hadoop cluster
P R E S E N T E D B Y K o j i N o g u c h i ⎪ J u n e 3 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. Agenda
2 Yahoo Confidential & Proprietary
Who I am
What’s NOT covered
List of tips that I found them useful
Q&A
3. Who I am
3 Yahoo Confidential & Proprietary
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
USER
OPS Dev
4. USER
OPS Dev
Who I am
4 Yahoo Confidential & Proprietary
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
• Covering everything !?
5. USER
OPS Dev
Who I am
5 Yahoo Confidential & Proprietary
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
• Covering everything !?
Covering any tiny pieces
not picked up by others …
6. What’s NOT covered in this talk
6 Yahoo Confidential & Proprietary
How to maintain the clusters (ops)
› automation on breakfixing, upgrading, monitoring
How to configure hadoop clusters (dev)
› Healthcheck script. Reserving disk space.
Number of slots per node, etc
How to tune your hadoop jobs (user)
› Less spilling and identifying bottlenecks
7. Some items that turned out to be useful
7 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
8. Some items that turned out to be useful
8 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
9. Slow nodes hurting Hadoop cluster ?
9 Yahoo Confidential & Proprietary
One of Hadoop’s strengths
FAULT TOLERANCE
10. Slow nodes hurting Hadoop cluster ?
10 Yahoo Confidential & Proprietary
One of Hadoop’s strengths
FAULT TOLERANCE
Complete failure GREAT! Fast Recovery
Partial failure NOT so great …
Tasks scheduled on slow nodes can take forever…
11. Speculative Execution helps?
11 Yahoo Confidential & Proprietary
Redundant copy of the slow task and take output
from whichever task finished faster
HDFS
(DATA)
task1
attempt0
NodeA
Output
14. Speculative Execution helps?
14 Yahoo Confidential & Proprietary
Speculative Execution helps a lot BUT …
1. Both nodes could be slow
2. Two attempts can still hit a slow datanode
3. Not all jobs can use speculative execution
15. How it used to work
15 Yahoo Confidential & Proprietary
With 100s of users
At least 1 user who’s good at policing
22. Identifying slow nodes
22 Yahoo Confidential & Proprietary
Speculative Execution
BEFORE: At runtime, users using it to
workaround hitting the slow nodes
HERE: A day(hours) later, using the logs to
identify the slow nodes.
27. Pig Job analyzing the job history
27 Yahoo Confidential & Proprietary
A = LOAD 'starling.starling_task_attempts' USING org.apache.hcatalog.pig.HCatLoader();
B = FILTER A by dt >= '$STARTDATE';
describe B;
C = FOREACH B generate grid,dt,task_id,task_attempt_id,type,host_name,status,start_ts,shuffle_time,sort_time,finish_time;
D = FILTER C by type == 'MAP' or type == 'REDUCE';
ATTEMPT0 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_0') == (SIZE(task_attempt_id) - 2);
ATTEMPT1 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_1') == (SIZE(task_attempt_id) - 2);
-- This would filter out any task that had only 1 attempt
TaskWithAtLeastTwoAttempts = join ATTEMPT0 by task_id, ATTEMPT1 by task_id;
-- For simplicity, I am only looking at task that had second task attempt successful
TaskWith2ndAttemptSuccess = filter TaskWithAtLeastTwoAttempts by (ATTEMPT1::status == 'SUCCESS' or ATTEMPT1::status == 'SUCCEEDED')
and ATTEMPT0::status == 'KILLED'
and ATTEMPT0::start_ts + ATTEMPT0::shuffle_time + ATTEMPT0::sort_time + ATTEMPT0::finish_time > ATTEMPT1::start_ts;
-- Counting number of first attempt fail and kill event for each node
FirstFailedKilledAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT0::grid, ATTEMPT0::host_name, ATTEMPT0::dt;
;
FirstFailedKilledAttempt2 = GROUP FirstFailedKilledAttempt by (grid,host_name, dt);
FirstFailedKilledAttempt3 = FOREACH FirstFailedKilledAttempt2 {
generate group.grid, group.host_name, group.dt, COUNT(FirstFailedKilledAttempt) as firstFailedCounts;
}
-- Only counting number of failure gave too much false positive. Counting how many times each node won.
SecondSuccessAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT1::grid, ATTEMPT1::host_name, ATTEMPT1::dt;
SecondSuccessAttempt2 = GROUP SecondSuccessAttempt by (grid,host_name, dt);
SecondSuccessAttempt3 = FOREACH SecondSuccessAttempt2 generate group.grid, group.host_name, group.dt, COUNT(SecondSuccessAttempt) as secondSuccessfulCounts;
GridNodeSuccessFailedCounts = join FirstFailedKilledAttempt3 by (grid,host_name,dt) left outer, SecondSuccessAttempt3 by (grid,host_name,dt);
GridNodeSuccessFailedCounts2 = FILTER GridNodeSuccessFailedCounts by firstFailedCounts > 50
and firstFailedCounts > (secondSuccessfulCounts is null ? 0 : secondSuccessfulCounts ) * 4;
28. Pig Job analyzing the job history
28 Yahoo Confidential & Proprietary
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
WIN[attempt1’s node] += 1
LOSE [attempt0’s node] +=1
Aggregate and print out any nodes with
LOSE[‘node’] > WIN[‘node’] * 4
&& Lose[‘node’] > 50
29. Result
29 Yahoo Confidential & Proprietary
Extremely slow nodes came up with
› Losing over 50 times and winning 0 or 1 time.
› Report to ops if this happens 2 days in a row
With mixed config&hardware cluster
› Showed trend in one type of nodes winning over
others
31. Some items that turned out to be useful
31 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
32. Misconfigured Nodes
32 Yahoo Confidential & Proprietary
Misconfigured Nodes
Tasks repeatedly fail for
some users&jobs
Like Termites
When users notice, it’s too late
33. Finding misconfigured nodes
33 Yahoo Confidential & Proprietary
Modify previous slow node detection script
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
34. Finding misconfigured nodes
34 Yahoo Confidential & Proprietary
Modify previous slow node detection script
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
Aggregate per node
fail count > 30 per day.
Add first 4 attempts ID and error messages.
FAILED
35. Results
Actively finding issues(but still manual)
35 Yahoo Confidential & Proprietary
1. Misconfigured nodes with job/error references
2. Detect regression in OS rolling upgrades
› Users code failing
› Some OS specific errors (disk/user-lookup/etc)
3. Detect partial network failures/slowness
› Pair of nodes failing with Map fetch failures
› Nodes failing on localizations
4. Detect Hadoop bug
› Like Disk-Fail-In-Place bug with dist cache
36. Some items that turned out to be useful
36 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
37. When cluster is bottlenecked on CPU…
37 Yahoo Confidential & Proprietary
In 0.23 + CapacitySchduler
Scheduling is based on memory
Memory Limit enforced by NodeManager
but not CPU
Less important after
Hadoop 2.X + CPU based/aware scheduling
39. JobHistory Log
39 Yahoo Confidential & Proprietary
For each task attempt
{"name":"CPU_MILLISECONDS","displayName":"CPU
time spent (ms)","value":826430}
{"name":"GC_TIME_MILLIS","displayName":"GC time
elapsed (ms)","value":38863},
40. Find possible jobs wasting CPU
40 Yahoo Confidential & Proprietary
For each task attempt
› CPU_TIME / attempt_time (0 ~ 20) [CPU_RATIO]
› GC_TIME / attempt_time (0 ~ 1.0) [GC_RATIO]
Aggregate per job and show
› MAX_CPU_RATIO, MAX_GC_RATIO,
AVG_CPU_RATIO, AVG_GC_RATIO
Also, collecting percentage per day per job
› resources(Mbytes)%
› CPU_TIME%
41. Results
41 Yahoo Confidential & Proprietary
Able to reach out to users wasting CPU
› Job having a task taking 10-20 times of cpu time
› Job using __% of resources but __% of cpu time
For one job with 85% gc time, 3 times speedup
with ParallelGC UseSerialGC (50% gc time)
Another +25% with G1GC but with more CPU time.
42. Some items that turned out to be useful
42 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
43. Limited HDFS Space
43 Yahoo Confidential & Proprietary
HDFS Quota has significantly reduced the
amount of abuse cases
But still seeing
“HDFS almost full! Please delete”
broadcast email once in a while.
44. Space to look for
44 Yahoo Confidential & Proprietary
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
3. Large directory that hasn’t been accessed
4. Large directory not compressed
45. Space to look for
45 Yahoo Confidential & Proprietary
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
3. Large directory that hasn’t been accessed
4. Large directory not compressed
46. Space to look for
46 Yahoo Confidential & Proprietary
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
Save the following result daily
hdfs dfs –count /user/* /projects/*/*
Take a diff from __ days back.
47. 4. Large directory not compressed
47 Yahoo Confidential & Proprietary
Too big to search and read the entire hdfs.
Need to cut down on search space
Interested in data created daily/hourly/etc
48. 4. Large directory not compressed
48 Yahoo Confidential & Proprietary
listdir=(/)
while(listdir not empty)
dir = listdir.pop
if (dir.size() < 5TBytes) {skip/continue}
if( dir has #subdirs with timestamp > 7) {
pick one large file from recent timestamp subdir
hcat $file | head –bytes 10MB | gzip –c | wc --bytes
} else {
push all subdirs to listdir
}
49. 4. Large directory not compressed
49 Yahoo Confidential & Proprietary
DIRNAME: /projects/DDD/d1/d2/d3/d4
DIRSIZE: 77,912,005,675,237 (~70TB)
CLUSTER: mycluster-tan
Username: ddd_aa
Compression Ratio: 12.6718
Sample File: /projects/DDD/d1/d2/d3/d4
/2014051405/part-m-00000
Sample Filesize: 134,217,852
Couple of hours in sequential script per cluster
50. Results
50 Yahoo Confidential & Proprietary
By periodically collecting the hdfs usage and
compression state
Identify stale dirs
Identify suddenly increasing dirs
Identify not compressed dirs
51. Some items that turned out to be useful
51 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
52. Why did my tiny job take hours yesterday?
52 Yahoo Confidential & Proprietary
Bug in users’ code
Queue full ?
Cluster full ?
If queue/cluster resource issue, what
changed recently?
53. Needed a way to look back
53 Yahoo Confidential & Proprietary
Periodically save the output of
% mapred job –list
…
JobId State StartTime UserName Queue
Priority UsedContainers RsvdContainers UsedMem
RsvdMem NeededMem AM info
job_1400781790269_206630 RUNNING
1400867814129 user1 queue1 NORMAL 2
0 3072M 0M 3072M
mycluster.___.com:8088/proxy/application_1400781790269_2066
30/
…
54. Needed a way to look back (2)
54 Yahoo Confidential & Proprietary
55. Results
55 Yahoo Confidential & Proprietary
Users can look back and see if the jobs
hang due to queue/cluster contention.
Saving ‘mapred job –list’ outputs let me go
back and check the individual jobs