Collection of Small Tips on Further Stabilizing your Hadoop Cluster

Collection of small tips on further
stabilizing your hadoop cluster
P R E S E N T E D B Y K o j i N o g u c h i ⎪ J u n e 3 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

Agenda
2 Yahoo Confidential & Proprietary
Who I am
What’s NOT covered
List of tips that I found them useful
Q&A

Who I am
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
USER
OPS Dev

USER
OPS Dev
Who I am
• Covering everything !?

USER
OPS Dev
Who I am
• Covering everything !?
 Covering any tiny pieces
not picked up by others …

What’s NOT covered in this talk
How to maintain the clusters (ops)
› automation on breakfixing, upgrading, monitoring
How to configure hadoop clusters (dev)
› Healthcheck script. Reserving disk space.
Number of slots per node, etc
How to tune your hadoop jobs (user)
› Less spilling and identifying bottlenecks

Some items that turned out to be useful
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions

Identifying…
Slow nodes
CPU wasting jobs

Slow nodes hurting Hadoop cluster ?
 One of Hadoop’s strengths
FAULT TOLERANCE

Slow nodes hurting Hadoop cluster ?
 One of Hadoop’s strengths
FAULT TOLERANCE
 Complete failure  GREAT! Fast Recovery
 Partial failure  NOT so great …
Tasks scheduled on slow nodes can take forever…

Speculative Execution helps?
 Redundant copy of the slow task and take output
from whichever task finished faster
HDFS
(DATA)
task1
attempt0
NodeA
Output

HDFS
(DATA)
task1
attempt0
NodeA
Output
task1
attempt1
NodeB
Take the output from
faster attempt

task1
attempt0
NodeA
Output
task1
attempt1
NodeB
NodeX
NodeY
NodeZ

Speculative Execution helps a lot BUT …
1. Both nodes could be slow
2. Two attempts can still hit a slow datanode
3. Not all jobs can use speculative execution

How it used to work
With 100s of users
 At least 1 user who’s good at policing

AUTOMATION
16
Yahoo Confidential & Proprietary
https://www.flickr.com/photos/antiuniverse/410462775

How it used to work

Identifying slow nodes
Comparing performance

Speculative Execution
BEFORE: At runtime, users using it to
workaround hitting the slow nodes
HERE: A day(hours) later, using the logs to
identify the slow nodes.

task1
attempt0
NodeA
Output
NodeB
NodeX
NodeY
TIME
task1
attempt1
0:00
10:00 20:00
Speculative Execution Again

task1
attempt0
NodeA
Output
NodeB
NodeX
NodeY
TIME
task1
attempt1
0:00
10:00 20:00
KILLEDLOSE['NodeA'] += 1

task1
attempt0
NodeA
Output
NodeB
NodeX
NodeY
TIME
task1
attempt1
0:00
10:00 20:00
KILLEDLOSE['NodeA'] += 1
WIN['NodeB'] += 1

JobHistory Log
{"type":"MAP_ATTEMPT_STARTED","event”:...
"attemptId":"attempt_1399615563645_308371_m_000000
_0","startTime":1400522576570,"trackerName":"gsbl317
47.blue.ygrid.yahoo.com",…
{"type":"MAP_ATTEMPT_FINISHED","event,"taskType":"MA
P","taskStatus":"SUCCEEDED","mapFinishTime":140052
2582385

Pig Job analyzing the job history
A = LOAD 'starling.starling_task_attempts' USING org.apache.hcatalog.pig.HCatLoader();
B = FILTER A by dt >= '$STARTDATE';
describe B;
C = FOREACH B generate grid,dt,task_id,task_attempt_id,type,host_name,status,start_ts,shuffle_time,sort_time,finish_time;
D = FILTER C by type == 'MAP' or type == 'REDUCE';
ATTEMPT0 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_0') == (SIZE(task_attempt_id) - 2);
ATTEMPT1 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_1') == (SIZE(task_attempt_id) - 2);
-- This would filter out any task that had only 1 attempt
TaskWithAtLeastTwoAttempts = join ATTEMPT0 by task_id, ATTEMPT1 by task_id;
-- For simplicity, I am only looking at task that had second task attempt successful
TaskWith2ndAttemptSuccess = filter TaskWithAtLeastTwoAttempts by (ATTEMPT1::status == 'SUCCESS' or ATTEMPT1::status == 'SUCCEEDED')
and ATTEMPT0::status == 'KILLED'
and ATTEMPT0::start_ts + ATTEMPT0::shuffle_time + ATTEMPT0::sort_time + ATTEMPT0::finish_time > ATTEMPT1::start_ts;
-- Counting number of first attempt fail and kill event for each node
FirstFailedKilledAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT0::grid, ATTEMPT0::host_name, ATTEMPT0::dt;
;
FirstFailedKilledAttempt2 = GROUP FirstFailedKilledAttempt by (grid,host_name, dt);
FirstFailedKilledAttempt3 = FOREACH FirstFailedKilledAttempt2 {
generate group.grid, group.host_name, group.dt, COUNT(FirstFailedKilledAttempt) as firstFailedCounts;
}
-- Only counting number of failure gave too much false positive. Counting how many times each node won.
SecondSuccessAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT1::grid, ATTEMPT1::host_name, ATTEMPT1::dt;
SecondSuccessAttempt2 = GROUP SecondSuccessAttempt by (grid,host_name, dt);
SecondSuccessAttempt3 = FOREACH SecondSuccessAttempt2 generate group.grid, group.host_name, group.dt, COUNT(SecondSuccessAttempt) as secondSuccessfulCounts;
GridNodeSuccessFailedCounts = join FirstFailedKilledAttempt3 by (grid,host_name,dt) left outer, SecondSuccessAttempt3 by (grid,host_name,dt);
GridNodeSuccessFailedCounts2 = FILTER GridNodeSuccessFailedCounts by firstFailedCounts > 50
and firstFailedCounts > (secondSuccessfulCounts is null ? 0 : secondSuccessfulCounts ) * 4;

Pig Job analyzing the job history
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
WIN[attempt1’s node] += 1
LOSE [attempt0’s node] +=1
Aggregate and print out any nodes with
LOSE[‘node’] > WIN[‘node’] * 4
&& Lose[‘node’] > 50

Result
Extremely slow nodes came up with
› Losing over 50 times and winning 0 or 1 time.
› Report to ops if this happens 2 days in a row
With mixed config&hardware cluster
› Showed trend in one type of nodes winning over
others

Identifying…
Slow nodes
CPU wasting jobs

Misconfigured Nodes
Misconfigured Nodes
 Tasks repeatedly fail for
some users&jobs
Like Termites
When users notice, it’s too late

Finding misconfigured nodes
Modify previous slow node detection script
For Any tasks with

Finding misconfigured nodes
Modify previous slow node detection script
For Any tasks with
Aggregate per node
fail count > 30 per day.
Add first 4 attempts ID and error messages.
FAILED

Results
 Actively finding issues(but still manual)
1. Misconfigured nodes with job/error references
2. Detect regression in OS rolling upgrades
› Users code failing
› Some OS specific errors (disk/user-lookup/etc)
3. Detect partial network failures/slowness
› Pair of nodes failing with Map fetch failures
› Nodes failing on localizations
4. Detect Hadoop bug
› Like Disk-Fail-In-Place bug with dist cache

Identifying…
Slow nodes
CPU wasting jobs

When cluster is bottlenecked on CPU…
In 0.23 + CapacitySchduler
Scheduling is based on memory
Memory Limit enforced by NodeManager
but not CPU
Less important after
Hadoop 2.X + CPU based/aware scheduling

Job & Task Counters

JobHistory Log
For each task attempt
{"name":"CPU_MILLISECONDS","displayName":"CPU
time spent (ms)","value":826430}
{"name":"GC_TIME_MILLIS","displayName":"GC time
elapsed (ms)","value":38863},

Find possible jobs wasting CPU
For each task attempt
› CPU_TIME / attempt_time (0 ~ 20) [CPU_RATIO]
› GC_TIME / attempt_time (0 ~ 1.0) [GC_RATIO]
Aggregate per job and show
› MAX_CPU_RATIO, MAX_GC_RATIO,
AVG_CPU_RATIO, AVG_GC_RATIO
Also, collecting percentage per day per job
› resources(Mbytes)%
› CPU_TIME%

Results
Able to reach out to users wasting CPU
› Job having a task taking 10-20 times of cpu time
› Job using __% of resources but __% of cpu time
For one job with 85% gc time, 3 times speedup
with ParallelGC  UseSerialGC (50% gc time)
Another +25% with G1GC but with more CPU time.

Identifying…
Slow nodes
CPU wasting jobs

Limited HDFS Space
 HDFS Quota has significantly reduced the
amount of abuse cases
 But still seeing
“HDFS almost full! Please delete”
broadcast email once in a while.

Space to look for
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
3. Large directory that hasn’t been accessed
4. Large directory not compressed

Space to look for
3. Large directory that hasn’t been accessed

Space to look for
Save the following result daily
hdfs dfs –count /user/* /projects/*/*
 Take a diff from __ days back.

Too big to search and read the entire hdfs.
Need to cut down on search space
Interested in data created daily/hourly/etc

listdir=(/)
while(listdir not empty)
dir = listdir.pop
if (dir.size() < 5TBytes) {skip/continue}
if( dir has #subdirs with timestamp > 7) {
pick one large file from recent timestamp subdir
hcat $file | head –bytes 10MB | gzip –c | wc --bytes
} else {
push all subdirs to listdir
}

DIRNAME: /projects/DDD/d1/d2/d3/d4
DIRSIZE: 77,912,005,675,237 (~70TB)
CLUSTER: mycluster-tan
Username: ddd_aa
Compression Ratio: 12.6718
Sample File: /projects/DDD/d1/d2/d3/d4
/2014051405/part-m-00000
Sample Filesize: 134,217,852
Couple of hours in sequential script per cluster

Results
By periodically collecting the hdfs usage and
compression state
Identify stale dirs
Identify suddenly increasing dirs
Identify not compressed dirs

Identifying…
Slow nodes
CPU wasting jobs

Why did my tiny job take hours yesterday?
Bug in users’ code
Queue full ?
Cluster full ?
If queue/cluster resource issue, what
changed recently?

Needed a way to look back
 Periodically save the output of
% mapred job –list
…
JobId State StartTime UserName Queue
Priority UsedContainers RsvdContainers UsedMem
RsvdMem NeededMem AM info
job_1400781790269_206630 RUNNING
1400867814129 user1 queue1 NORMAL 2
0 3072M 0M 3072M
mycluster.___.com:8088/proxy/application_1400781790269_2066
30/
…

Needed a way to look back (2)

Results
Users can look back and see if the jobs
hang due to queue/cluster contention.
Saving ‘mapred job –list’ outputs let me go
back and check the individual jobs

What’s covered
Identifying…
Slow nodes
CPU wasting jobs

Thank You
@kojinoguchi
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

Collection of Small Tips on Further Stabilizing your Hadoop Cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Collection of Small Tips on Further Stabilizing your Hadoop Cluster

Similar to Collection of Small Tips on Further Stabilizing your Hadoop Cluster (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Collection of Small Tips on Further Stabilizing your Hadoop Cluster