SlideShare a Scribd company logo
1 of 57
Collection of small tips on further
stabilizing your hadoop cluster
P R E S E N T E D B Y K o j i N o g u c h i ⎪ J u n e 3 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
Agenda
2 Yahoo Confidential & Proprietary
Who I am
What’s NOT covered
List of tips that I found them useful
Q&A
Who I am
3 Yahoo Confidential & Proprietary
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
USER
OPS Dev
USER
OPS Dev
Who I am
4 Yahoo Confidential & Proprietary
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
• Covering everything !?
USER
OPS Dev
Who I am
5 Yahoo Confidential & Proprietary
Grid Support/Solutions at Yahoo.
› Helping users on the internal hadoop clusters
• Covering everything !?
 Covering any tiny pieces
not picked up by others …
What’s NOT covered in this talk
6 Yahoo Confidential & Proprietary
How to maintain the clusters (ops)
› automation on breakfixing, upgrading, monitoring
How to configure hadoop clusters (dev)
› Healthcheck script. Reserving disk space.
Number of slots per node, etc
How to tune your hadoop jobs (user)
› Less spilling and identifying bottlenecks
Some items that turned out to be useful
7 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
Some items that turned out to be useful
8 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
Slow nodes hurting Hadoop cluster ?
9 Yahoo Confidential & Proprietary
 One of Hadoop’s strengths
FAULT TOLERANCE
Slow nodes hurting Hadoop cluster ?
10 Yahoo Confidential & Proprietary
 One of Hadoop’s strengths
FAULT TOLERANCE
 Complete failure  GREAT! Fast Recovery
 Partial failure  NOT so great …
Tasks scheduled on slow nodes can take forever…
Speculative Execution helps?
11 Yahoo Confidential & Proprietary
 Redundant copy of the slow task and take output
from whichever task finished faster
HDFS
(DATA)
task1
attempt0
NodeA
Output
HDFS
(DATA)
task1
attempt0
NodeA
Output
task1
attempt1
NodeB
Take the output from
faster attempt
Speculative Execution helps?
12 Yahoo Confidential & Proprietary
 Redundant copy of the slow task and take output
from whichever task finished faster
task1
attempt0
NodeA
Output
task1
attempt1
NodeB
NodeX
NodeY
NodeZ
Speculative Execution helps?
13 Yahoo Confidential & Proprietary
 Redundant copy of the slow task and take output
from whichever task finished faster
Speculative Execution helps?
14 Yahoo Confidential & Proprietary
Speculative Execution helps a lot BUT …
1. Both nodes could be slow
2. Two attempts can still hit a slow datanode
3. Not all jobs can use speculative execution
How it used to work
15 Yahoo Confidential & Proprietary
With 100s of users
 At least 1 user who’s good at policing
AUTOMATION
16
Yahoo Confidential & Proprietary
https://www.flickr.com/photos/antiuniverse/410462775
How it used to work
17 Yahoo Confidential & Proprietary
How it used to work
18 Yahoo Confidential & Proprietary
Identifying slow nodes
19 Yahoo Confidential & Proprietary
Comparing performance
Identifying slow nodes
20 Yahoo Confidential & Proprietary
Comparing performance
Identifying slow nodes
21 Yahoo Confidential & Proprietary
Comparing performance
Identifying slow nodes
22 Yahoo Confidential & Proprietary
Speculative Execution
BEFORE: At runtime, users using it to
workaround hitting the slow nodes
HERE: A day(hours) later, using the logs to
identify the slow nodes.
task1
attempt0
NodeA
Output
NodeB
NodeX
NodeY
TIME
task1
attempt1
0:00
10:00 20:00
Speculative Execution Again
23 Yahoo Confidential & Proprietary
task1
attempt0
NodeA
Output
NodeB
NodeX
NodeY
TIME
task1
attempt1
0:00
10:00 20:00
KILLEDLOSE['NodeA'] += 1
Speculative Execution Again
24 Yahoo Confidential & Proprietary
task1
attempt0
NodeA
Output
NodeB
NodeX
NodeY
TIME
task1
attempt1
0:00
10:00 20:00
KILLEDLOSE['NodeA'] += 1
WIN['NodeB'] += 1
Speculative Execution Again
25 Yahoo Confidential & Proprietary
JobHistory Log
26 Yahoo Confidential & Proprietary
{"type":"MAP_ATTEMPT_STARTED","event”:...
"attemptId":"attempt_1399615563645_308371_m_000000
_0","startTime":1400522576570,"trackerName":"gsbl317
47.blue.ygrid.yahoo.com",…
{"type":"MAP_ATTEMPT_FINISHED","event,"taskType":"MA
P","taskStatus":"SUCCEEDED","mapFinishTime":140052
2582385
Pig Job analyzing the job history
27 Yahoo Confidential & Proprietary
A = LOAD 'starling.starling_task_attempts' USING org.apache.hcatalog.pig.HCatLoader();
B = FILTER A by dt >= '$STARTDATE';
describe B;
C = FOREACH B generate grid,dt,task_id,task_attempt_id,type,host_name,status,start_ts,shuffle_time,sort_time,finish_time;
D = FILTER C by type == 'MAP' or type == 'REDUCE';
ATTEMPT0 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_0') == (SIZE(task_attempt_id) - 2);
ATTEMPT1 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_1') == (SIZE(task_attempt_id) - 2);
-- This would filter out any task that had only 1 attempt
TaskWithAtLeastTwoAttempts = join ATTEMPT0 by task_id, ATTEMPT1 by task_id;
-- For simplicity, I am only looking at task that had second task attempt successful
TaskWith2ndAttemptSuccess = filter TaskWithAtLeastTwoAttempts by (ATTEMPT1::status == 'SUCCESS' or ATTEMPT1::status == 'SUCCEEDED')
and ATTEMPT0::status == 'KILLED'
and ATTEMPT0::start_ts + ATTEMPT0::shuffle_time + ATTEMPT0::sort_time + ATTEMPT0::finish_time > ATTEMPT1::start_ts;
-- Counting number of first attempt fail and kill event for each node
FirstFailedKilledAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT0::grid, ATTEMPT0::host_name, ATTEMPT0::dt;
;
FirstFailedKilledAttempt2 = GROUP FirstFailedKilledAttempt by (grid,host_name, dt);
FirstFailedKilledAttempt3 = FOREACH FirstFailedKilledAttempt2 {
generate group.grid, group.host_name, group.dt, COUNT(FirstFailedKilledAttempt) as firstFailedCounts;
}
-- Only counting number of failure gave too much false positive. Counting how many times each node won.
SecondSuccessAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT1::grid, ATTEMPT1::host_name, ATTEMPT1::dt;
SecondSuccessAttempt2 = GROUP SecondSuccessAttempt by (grid,host_name, dt);
SecondSuccessAttempt3 = FOREACH SecondSuccessAttempt2 generate group.grid, group.host_name, group.dt, COUNT(SecondSuccessAttempt) as secondSuccessfulCounts;
GridNodeSuccessFailedCounts = join FirstFailedKilledAttempt3 by (grid,host_name,dt) left outer, SecondSuccessAttempt3 by (grid,host_name,dt);
GridNodeSuccessFailedCounts2 = FILTER GridNodeSuccessFailedCounts by firstFailedCounts > 50
and firstFailedCounts > (secondSuccessfulCounts is null ? 0 : secondSuccessfulCounts ) * 4;
Pig Job analyzing the job history
28 Yahoo Confidential & Proprietary
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
WIN[attempt1’s node] += 1
LOSE [attempt0’s node] +=1
Aggregate and print out any nodes with
LOSE[‘node’] > WIN[‘node’] * 4
&& Lose[‘node’] > 50
Result
29 Yahoo Confidential & Proprietary
Extremely slow nodes came up with
› Losing over 50 times and winning 0 or 1 time.
› Report to ops if this happens 2 days in a row
With mixed config&hardware cluster
› Showed trend in one type of nodes winning over
others
30 Yahoo Confidential & Proprietary
Some items that turned out to be useful
31 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
Misconfigured Nodes
32 Yahoo Confidential & Proprietary
Misconfigured Nodes
 Tasks repeatedly fail for
some users&jobs
Like Termites
When users notice, it’s too late
Finding misconfigured nodes
33 Yahoo Confidential & Proprietary
Modify previous slow node detection script
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
Finding misconfigured nodes
34 Yahoo Confidential & Proprietary
Modify previous slow node detection script
For Any tasks with
attempt0 “KILLED” and attempt1“SUCCESS”
&& attempt0’s finishtime > attempt1’s starttime
Aggregate per node
fail count > 30 per day.
Add first 4 attempts ID and error messages.
FAILED
Results
 Actively finding issues(but still manual)
35 Yahoo Confidential & Proprietary
1. Misconfigured nodes with job/error references
2. Detect regression in OS rolling upgrades
› Users code failing
› Some OS specific errors (disk/user-lookup/etc)
3. Detect partial network failures/slowness
› Pair of nodes failing with Map fetch failures
› Nodes failing on localizations
4. Detect Hadoop bug
› Like Disk-Fail-In-Place bug with dist cache
Some items that turned out to be useful
36 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
When cluster is bottlenecked on CPU…
37 Yahoo Confidential & Proprietary
In 0.23 + CapacitySchduler
Scheduling is based on memory
Memory Limit enforced by NodeManager
but not CPU
Less important after
Hadoop 2.X + CPU based/aware scheduling
Job & Task Counters
38 Yahoo Confidential & Proprietary
JobHistory Log
39 Yahoo Confidential & Proprietary
For each task attempt
{"name":"CPU_MILLISECONDS","displayName":"CPU
time spent (ms)","value":826430}
{"name":"GC_TIME_MILLIS","displayName":"GC time
elapsed (ms)","value":38863},
Find possible jobs wasting CPU
40 Yahoo Confidential & Proprietary
For each task attempt
› CPU_TIME / attempt_time (0 ~ 20) [CPU_RATIO]
› GC_TIME / attempt_time (0 ~ 1.0) [GC_RATIO]
Aggregate per job and show
› MAX_CPU_RATIO, MAX_GC_RATIO,
AVG_CPU_RATIO, AVG_GC_RATIO
Also, collecting percentage per day per job
› resources(Mbytes)%
› CPU_TIME%
Results
41 Yahoo Confidential & Proprietary
Able to reach out to users wasting CPU
› Job having a task taking 10-20 times of cpu time
› Job using __% of resources but __% of cpu time
For one job with 85% gc time, 3 times speedup
with ParallelGC  UseSerialGC (50% gc time)
Another +25% with G1GC but with more CPU time.
Some items that turned out to be useful
42 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
Limited HDFS Space
43 Yahoo Confidential & Proprietary
 HDFS Quota has significantly reduced the
amount of abuse cases
 But still seeing
“HDFS almost full! Please delete”
broadcast email once in a while.
Space to look for
44 Yahoo Confidential & Proprietary
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
3. Large directory that hasn’t been accessed
4. Large directory not compressed
Space to look for
45 Yahoo Confidential & Proprietary
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
3. Large directory that hasn’t been accessed
4. Large directory not compressed
Space to look for
46 Yahoo Confidential & Proprietary
1. Large directory that hasn’t changed
2. Large directory that suddenly increased
Save the following result daily
hdfs dfs –count /user/* /projects/*/*
 Take a diff from __ days back.
4. Large directory not compressed
47 Yahoo Confidential & Proprietary
Too big to search and read the entire hdfs.
Need to cut down on search space
Interested in data created daily/hourly/etc
4. Large directory not compressed
48 Yahoo Confidential & Proprietary
listdir=(/)
while(listdir not empty)
dir = listdir.pop
if (dir.size() < 5TBytes) {skip/continue}
if( dir has #subdirs with timestamp > 7) {
pick one large file from recent timestamp subdir
hcat $file | head –bytes 10MB | gzip –c | wc --bytes
} else {
push all subdirs to listdir
}
4. Large directory not compressed
49 Yahoo Confidential & Proprietary
DIRNAME: /projects/DDD/d1/d2/d3/d4
DIRSIZE: 77,912,005,675,237 (~70TB)
CLUSTER: mycluster-tan
Username: ddd_aa
Compression Ratio: 12.6718
Sample File: /projects/DDD/d1/d2/d3/d4
/2014051405/part-m-00000
Sample Filesize: 134,217,852
Couple of hours in sequential script per cluster
Results
50 Yahoo Confidential & Proprietary
By periodically collecting the hdfs usage and
compression state
Identify stale dirs
Identify suddenly increasing dirs
Identify not compressed dirs
Some items that turned out to be useful
51 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
Why did my tiny job take hours yesterday?
52 Yahoo Confidential & Proprietary
Bug in users’ code
Queue full ?
Cluster full ?
If queue/cluster resource issue, what
changed recently?
Needed a way to look back
53 Yahoo Confidential & Proprietary
 Periodically save the output of
% mapred job –list
…
JobId State StartTime UserName Queue
Priority UsedContainers RsvdContainers UsedMem
RsvdMem NeededMem AM info
job_1400781790269_206630 RUNNING
1400867814129 user1 queue1 NORMAL 2
0 3072M 0M 3072M
mycluster.___.com:8088/proxy/application_1400781790269_2066
30/
…
Needed a way to look back (2)
54 Yahoo Confidential & Proprietary
Results
55 Yahoo Confidential & Proprietary
Users can look back and see if the jobs
hang due to queue/cluster contention.
Saving ‘mapred job –list’ outputs let me go
back and check the individual jobs
What’s covered
56 Yahoo Confidential & Proprietary
Identifying…
Slow nodes
Misconfigured nodes
CPU wasting jobs
HDFS wasting users
Queue congestions
Thank You
@kojinoguchi
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

More Related Content

What's hot

Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLOlivier Doucet
 
Don't dump thread dumps
Don't dump thread dumpsDon't dump thread dumps
Don't dump thread dumpsTier1app
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?SegFaultConf
 
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in MicroservicesLife Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in MicroservicesSean Chittenden
 
了解Oracle rac brain split resolution
了解Oracle rac brain split resolution了解Oracle rac brain split resolution
了解Oracle rac brain split resolutionmaclean liu
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersJonathan Levin
 
Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)Geoffrey De Smet
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on firePatrick McFadin
 
Percona Live '18 Tutorial: The Accidental DBA
Percona Live '18 Tutorial: The Accidental DBAPercona Live '18 Tutorial: The Accidental DBA
Percona Live '18 Tutorial: The Accidental DBAJenni Snyder
 
The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210Mahmoud Samir Fayed
 
The Ring programming language version 1.10 book - Part 208 of 212
The Ring programming language version 1.10 book - Part 208 of 212The Ring programming language version 1.10 book - Part 208 of 212
The Ring programming language version 1.10 book - Part 208 of 212Mahmoud Samir Fayed
 
Distributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayDistributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayodnoklassniki.ru
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB
 
PXC (Xtradb) Failure and Recovery
PXC (Xtradb) Failure and RecoveryPXC (Xtradb) Failure and Recovery
PXC (Xtradb) Failure and RecoveryAlkin Tezuysal
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC HeroTier1app
 
Puppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on WindowsPuppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on WindowsNicolas Corrarello
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldCan Ozdoruk
 

What's hot (19)

Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
 
Don't dump thread dumps
Don't dump thread dumpsDon't dump thread dumps
Don't dump thread dumps
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in MicroservicesLife Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
 
了解Oracle rac brain split resolution
了解Oracle rac brain split resolution了解Oracle rac brain split resolution
了解Oracle rac brain split resolution
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)Developing applications with rules, workflow and event processing (it@cork 2010)
Developing applications with rules, workflow and event processing (it@cork 2010)
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
Percona Live '18 Tutorial: The Accidental DBA
Percona Live '18 Tutorial: The Accidental DBAPercona Live '18 Tutorial: The Accidental DBA
Percona Live '18 Tutorial: The Accidental DBA
 
The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210The Ring programming language version 1.9 book - Part 57 of 210
The Ring programming language version 1.9 book - Part 57 of 210
 
The Ring programming language version 1.10 book - Part 208 of 212
The Ring programming language version 1.10 book - Part 208 of 212The Ring programming language version 1.10 book - Part 208 of 212
The Ring programming language version 1.10 book - Part 208 of 212
 
Distributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevdayDistributed systems at ok.ru #rigadevday
Distributed systems at ok.ru #rigadevday
 
Universal Userland
Universal UserlandUniversal Userland
Universal Userland
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
 
PXC (Xtradb) Failure and Recovery
PXC (Xtradb) Failure and RecoveryPXC (Xtradb) Failure and Recovery
PXC (Xtradb) Failure and Recovery
 
Leak, lock and a long pause
Leak, lock and a long pauseLeak, lock and a long pause
Leak, lock and a long pause
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC Hero
 
Puppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on WindowsPuppetconf2016 Puppet on Windows
Puppetconf2016 Puppet on Windows
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New World
 

Viewers also liked

Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 

Viewers also liked (6)

Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 

Similar to Collection of Small Tips on Further Stabilizing your Hadoop Cluster

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...
Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...
Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...Redis Labs
 
Puppetcamp Melbourne - puppetdb
Puppetcamp Melbourne - puppetdbPuppetcamp Melbourne - puppetdb
Puppetcamp Melbourne - puppetdbm_richardson
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet
 
DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...
DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...
DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...Felipe Prado
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linuxPavel Klimiankou
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet
 
Queue in the cloud with mongo db
Queue in the cloud with mongo dbQueue in the cloud with mongo db
Queue in the cloud with mongo dbNuri Halperin
 
MongoDB as a Cloud Queue
MongoDB as a Cloud QueueMongoDB as a Cloud Queue
MongoDB as a Cloud QueueMongoDB
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceMySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceSeveralnines
 
Easy MySQL Replication Setup and Troubleshooting
Easy MySQL Replication Setup and TroubleshootingEasy MySQL Replication Setup and Troubleshooting
Easy MySQL Replication Setup and TroubleshootingBob Burgess
 
Solaris DTrace, An Introduction
Solaris DTrace, An IntroductionSolaris DTrace, An Introduction
Solaris DTrace, An Introductionsatyajit_t
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
Preventing and Resolving MySQL Downtime
Preventing and Resolving MySQL DowntimePreventing and Resolving MySQL Downtime
Preventing and Resolving MySQL DowntimeJervin Real
 
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph Ceph Community
 
Zend Framework 1 + Doctrine 2
Zend Framework 1 + Doctrine 2Zend Framework 1 + Doctrine 2
Zend Framework 1 + Doctrine 2Ralph Schindler
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 

Similar to Collection of Small Tips on Further Stabilizing your Hadoop Cluster (20)

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...
Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...
Reinforcement Learning On Hundreds Of Thousands Of Cores: Henrique Pondedeoli...
 
Puppetcamp Melbourne - puppetdb
Puppetcamp Melbourne - puppetdbPuppetcamp Melbourne - puppetdb
Puppetcamp Melbourne - puppetdb
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDBPuppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...
DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...
DEF CON 27 - workshop - GUILLAUME ROSS - defending environments and hunting m...
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linux
 
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
Puppet Camp Melbourne 2014: Node Collaboration with PuppetDB
 
Queue in the cloud with mongo db
Queue in the cloud with mongo dbQueue in the cloud with mongo db
Queue in the cloud with mongo db
 
MongoDB as a Cloud Queue
MongoDB as a Cloud QueueMongoDB as a Cloud Queue
MongoDB as a Cloud Queue
 
Into The Box 2020 Keynote Day 1
Into The Box 2020 Keynote Day 1Into The Box 2020 Keynote Day 1
Into The Box 2020 Keynote Day 1
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User ConferenceMySQL Cluster Performance Tuning - 2013 MySQL User Conference
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
 
Easy MySQL Replication Setup and Troubleshooting
Easy MySQL Replication Setup and TroubleshootingEasy MySQL Replication Setup and Troubleshooting
Easy MySQL Replication Setup and Troubleshooting
 
Solaris DTrace, An Introduction
Solaris DTrace, An IntroductionSolaris DTrace, An Introduction
Solaris DTrace, An Introduction
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
Preventing and Resolving MySQL Downtime
Preventing and Resolving MySQL DowntimePreventing and Resolving MySQL Downtime
Preventing and Resolving MySQL Downtime
 
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
Ceph Day SF 2015 - Big Data Applications and Tuning in Ceph
 
Zend Framework 1 + Doctrine 2
Zend Framework 1 + Doctrine 2Zend Framework 1 + Doctrine 2
Zend Framework 1 + Doctrine 2
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Collection of Small Tips on Further Stabilizing your Hadoop Cluster

  • 1. Collection of small tips on further stabilizing your hadoop cluster P R E S E N T E D B Y K o j i N o g u c h i ⎪ J u n e 3 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. Agenda 2 Yahoo Confidential & Proprietary Who I am What’s NOT covered List of tips that I found them useful Q&A
  • 3. Who I am 3 Yahoo Confidential & Proprietary Grid Support/Solutions at Yahoo. › Helping users on the internal hadoop clusters USER OPS Dev
  • 4. USER OPS Dev Who I am 4 Yahoo Confidential & Proprietary Grid Support/Solutions at Yahoo. › Helping users on the internal hadoop clusters • Covering everything !?
  • 5. USER OPS Dev Who I am 5 Yahoo Confidential & Proprietary Grid Support/Solutions at Yahoo. › Helping users on the internal hadoop clusters • Covering everything !?  Covering any tiny pieces not picked up by others …
  • 6. What’s NOT covered in this talk 6 Yahoo Confidential & Proprietary How to maintain the clusters (ops) › automation on breakfixing, upgrading, monitoring How to configure hadoop clusters (dev) › Healthcheck script. Reserving disk space. Number of slots per node, etc How to tune your hadoop jobs (user) › Less spilling and identifying bottlenecks
  • 7. Some items that turned out to be useful 7 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 8. Some items that turned out to be useful 8 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 9. Slow nodes hurting Hadoop cluster ? 9 Yahoo Confidential & Proprietary  One of Hadoop’s strengths FAULT TOLERANCE
  • 10. Slow nodes hurting Hadoop cluster ? 10 Yahoo Confidential & Proprietary  One of Hadoop’s strengths FAULT TOLERANCE  Complete failure  GREAT! Fast Recovery  Partial failure  NOT so great … Tasks scheduled on slow nodes can take forever…
  • 11. Speculative Execution helps? 11 Yahoo Confidential & Proprietary  Redundant copy of the slow task and take output from whichever task finished faster HDFS (DATA) task1 attempt0 NodeA Output
  • 12. HDFS (DATA) task1 attempt0 NodeA Output task1 attempt1 NodeB Take the output from faster attempt Speculative Execution helps? 12 Yahoo Confidential & Proprietary  Redundant copy of the slow task and take output from whichever task finished faster
  • 13. task1 attempt0 NodeA Output task1 attempt1 NodeB NodeX NodeY NodeZ Speculative Execution helps? 13 Yahoo Confidential & Proprietary  Redundant copy of the slow task and take output from whichever task finished faster
  • 14. Speculative Execution helps? 14 Yahoo Confidential & Proprietary Speculative Execution helps a lot BUT … 1. Both nodes could be slow 2. Two attempts can still hit a slow datanode 3. Not all jobs can use speculative execution
  • 15. How it used to work 15 Yahoo Confidential & Proprietary With 100s of users  At least 1 user who’s good at policing
  • 16. AUTOMATION 16 Yahoo Confidential & Proprietary https://www.flickr.com/photos/antiuniverse/410462775
  • 17. How it used to work 17 Yahoo Confidential & Proprietary
  • 18. How it used to work 18 Yahoo Confidential & Proprietary
  • 19. Identifying slow nodes 19 Yahoo Confidential & Proprietary Comparing performance
  • 20. Identifying slow nodes 20 Yahoo Confidential & Proprietary Comparing performance
  • 21. Identifying slow nodes 21 Yahoo Confidential & Proprietary Comparing performance
  • 22. Identifying slow nodes 22 Yahoo Confidential & Proprietary Speculative Execution BEFORE: At runtime, users using it to workaround hitting the slow nodes HERE: A day(hours) later, using the logs to identify the slow nodes.
  • 25. task1 attempt0 NodeA Output NodeB NodeX NodeY TIME task1 attempt1 0:00 10:00 20:00 KILLEDLOSE['NodeA'] += 1 WIN['NodeB'] += 1 Speculative Execution Again 25 Yahoo Confidential & Proprietary
  • 26. JobHistory Log 26 Yahoo Confidential & Proprietary {"type":"MAP_ATTEMPT_STARTED","event”:... "attemptId":"attempt_1399615563645_308371_m_000000 _0","startTime":1400522576570,"trackerName":"gsbl317 47.blue.ygrid.yahoo.com",… {"type":"MAP_ATTEMPT_FINISHED","event,"taskType":"MA P","taskStatus":"SUCCEEDED","mapFinishTime":140052 2582385
  • 27. Pig Job analyzing the job history 27 Yahoo Confidential & Proprietary A = LOAD 'starling.starling_task_attempts' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A by dt >= '$STARTDATE'; describe B; C = FOREACH B generate grid,dt,task_id,task_attempt_id,type,host_name,status,start_ts,shuffle_time,sort_time,finish_time; D = FILTER C by type == 'MAP' or type == 'REDUCE'; ATTEMPT0 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_0') == (SIZE(task_attempt_id) - 2); ATTEMPT1 = FILTER D by LAST_INDEX_OF(task_attempt_id, '_1') == (SIZE(task_attempt_id) - 2); -- This would filter out any task that had only 1 attempt TaskWithAtLeastTwoAttempts = join ATTEMPT0 by task_id, ATTEMPT1 by task_id; -- For simplicity, I am only looking at task that had second task attempt successful TaskWith2ndAttemptSuccess = filter TaskWithAtLeastTwoAttempts by (ATTEMPT1::status == 'SUCCESS' or ATTEMPT1::status == 'SUCCEEDED') and ATTEMPT0::status == 'KILLED' and ATTEMPT0::start_ts + ATTEMPT0::shuffle_time + ATTEMPT0::sort_time + ATTEMPT0::finish_time > ATTEMPT1::start_ts; -- Counting number of first attempt fail and kill event for each node FirstFailedKilledAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT0::grid, ATTEMPT0::host_name, ATTEMPT0::dt; ; FirstFailedKilledAttempt2 = GROUP FirstFailedKilledAttempt by (grid,host_name, dt); FirstFailedKilledAttempt3 = FOREACH FirstFailedKilledAttempt2 { generate group.grid, group.host_name, group.dt, COUNT(FirstFailedKilledAttempt) as firstFailedCounts; } -- Only counting number of failure gave too much false positive. Counting how many times each node won. SecondSuccessAttempt = FOREACH TaskWith2ndAttemptSuccess generate ATTEMPT1::grid, ATTEMPT1::host_name, ATTEMPT1::dt; SecondSuccessAttempt2 = GROUP SecondSuccessAttempt by (grid,host_name, dt); SecondSuccessAttempt3 = FOREACH SecondSuccessAttempt2 generate group.grid, group.host_name, group.dt, COUNT(SecondSuccessAttempt) as secondSuccessfulCounts; GridNodeSuccessFailedCounts = join FirstFailedKilledAttempt3 by (grid,host_name,dt) left outer, SecondSuccessAttempt3 by (grid,host_name,dt); GridNodeSuccessFailedCounts2 = FILTER GridNodeSuccessFailedCounts by firstFailedCounts > 50 and firstFailedCounts > (secondSuccessfulCounts is null ? 0 : secondSuccessfulCounts ) * 4;
  • 28. Pig Job analyzing the job history 28 Yahoo Confidential & Proprietary For Any tasks with attempt0 “KILLED” and attempt1“SUCCESS” && attempt0’s finishtime > attempt1’s starttime WIN[attempt1’s node] += 1 LOSE [attempt0’s node] +=1 Aggregate and print out any nodes with LOSE[‘node’] > WIN[‘node’] * 4 && Lose[‘node’] > 50
  • 29. Result 29 Yahoo Confidential & Proprietary Extremely slow nodes came up with › Losing over 50 times and winning 0 or 1 time. › Report to ops if this happens 2 days in a row With mixed config&hardware cluster › Showed trend in one type of nodes winning over others
  • 30. 30 Yahoo Confidential & Proprietary
  • 31. Some items that turned out to be useful 31 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 32. Misconfigured Nodes 32 Yahoo Confidential & Proprietary Misconfigured Nodes  Tasks repeatedly fail for some users&jobs Like Termites When users notice, it’s too late
  • 33. Finding misconfigured nodes 33 Yahoo Confidential & Proprietary Modify previous slow node detection script For Any tasks with attempt0 “KILLED” and attempt1“SUCCESS” && attempt0’s finishtime > attempt1’s starttime
  • 34. Finding misconfigured nodes 34 Yahoo Confidential & Proprietary Modify previous slow node detection script For Any tasks with attempt0 “KILLED” and attempt1“SUCCESS” && attempt0’s finishtime > attempt1’s starttime Aggregate per node fail count > 30 per day. Add first 4 attempts ID and error messages. FAILED
  • 35. Results  Actively finding issues(but still manual) 35 Yahoo Confidential & Proprietary 1. Misconfigured nodes with job/error references 2. Detect regression in OS rolling upgrades › Users code failing › Some OS specific errors (disk/user-lookup/etc) 3. Detect partial network failures/slowness › Pair of nodes failing with Map fetch failures › Nodes failing on localizations 4. Detect Hadoop bug › Like Disk-Fail-In-Place bug with dist cache
  • 36. Some items that turned out to be useful 36 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 37. When cluster is bottlenecked on CPU… 37 Yahoo Confidential & Proprietary In 0.23 + CapacitySchduler Scheduling is based on memory Memory Limit enforced by NodeManager but not CPU Less important after Hadoop 2.X + CPU based/aware scheduling
  • 38. Job & Task Counters 38 Yahoo Confidential & Proprietary
  • 39. JobHistory Log 39 Yahoo Confidential & Proprietary For each task attempt {"name":"CPU_MILLISECONDS","displayName":"CPU time spent (ms)","value":826430} {"name":"GC_TIME_MILLIS","displayName":"GC time elapsed (ms)","value":38863},
  • 40. Find possible jobs wasting CPU 40 Yahoo Confidential & Proprietary For each task attempt › CPU_TIME / attempt_time (0 ~ 20) [CPU_RATIO] › GC_TIME / attempt_time (0 ~ 1.0) [GC_RATIO] Aggregate per job and show › MAX_CPU_RATIO, MAX_GC_RATIO, AVG_CPU_RATIO, AVG_GC_RATIO Also, collecting percentage per day per job › resources(Mbytes)% › CPU_TIME%
  • 41. Results 41 Yahoo Confidential & Proprietary Able to reach out to users wasting CPU › Job having a task taking 10-20 times of cpu time › Job using __% of resources but __% of cpu time For one job with 85% gc time, 3 times speedup with ParallelGC  UseSerialGC (50% gc time) Another +25% with G1GC but with more CPU time.
  • 42. Some items that turned out to be useful 42 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 43. Limited HDFS Space 43 Yahoo Confidential & Proprietary  HDFS Quota has significantly reduced the amount of abuse cases  But still seeing “HDFS almost full! Please delete” broadcast email once in a while.
  • 44. Space to look for 44 Yahoo Confidential & Proprietary 1. Large directory that hasn’t changed 2. Large directory that suddenly increased 3. Large directory that hasn’t been accessed 4. Large directory not compressed
  • 45. Space to look for 45 Yahoo Confidential & Proprietary 1. Large directory that hasn’t changed 2. Large directory that suddenly increased 3. Large directory that hasn’t been accessed 4. Large directory not compressed
  • 46. Space to look for 46 Yahoo Confidential & Proprietary 1. Large directory that hasn’t changed 2. Large directory that suddenly increased Save the following result daily hdfs dfs –count /user/* /projects/*/*  Take a diff from __ days back.
  • 47. 4. Large directory not compressed 47 Yahoo Confidential & Proprietary Too big to search and read the entire hdfs. Need to cut down on search space Interested in data created daily/hourly/etc
  • 48. 4. Large directory not compressed 48 Yahoo Confidential & Proprietary listdir=(/) while(listdir not empty) dir = listdir.pop if (dir.size() < 5TBytes) {skip/continue} if( dir has #subdirs with timestamp > 7) { pick one large file from recent timestamp subdir hcat $file | head –bytes 10MB | gzip –c | wc --bytes } else { push all subdirs to listdir }
  • 49. 4. Large directory not compressed 49 Yahoo Confidential & Proprietary DIRNAME: /projects/DDD/d1/d2/d3/d4 DIRSIZE: 77,912,005,675,237 (~70TB) CLUSTER: mycluster-tan Username: ddd_aa Compression Ratio: 12.6718 Sample File: /projects/DDD/d1/d2/d3/d4 /2014051405/part-m-00000 Sample Filesize: 134,217,852 Couple of hours in sequential script per cluster
  • 50. Results 50 Yahoo Confidential & Proprietary By periodically collecting the hdfs usage and compression state Identify stale dirs Identify suddenly increasing dirs Identify not compressed dirs
  • 51. Some items that turned out to be useful 51 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 52. Why did my tiny job take hours yesterday? 52 Yahoo Confidential & Proprietary Bug in users’ code Queue full ? Cluster full ? If queue/cluster resource issue, what changed recently?
  • 53. Needed a way to look back 53 Yahoo Confidential & Proprietary  Periodically save the output of % mapred job –list … JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info job_1400781790269_206630 RUNNING 1400867814129 user1 queue1 NORMAL 2 0 3072M 0M 3072M mycluster.___.com:8088/proxy/application_1400781790269_2066 30/ …
  • 54. Needed a way to look back (2) 54 Yahoo Confidential & Proprietary
  • 55. Results 55 Yahoo Confidential & Proprietary Users can look back and see if the jobs hang due to queue/cluster contention. Saving ‘mapred job –list’ outputs let me go back and check the individual jobs
  • 56. What’s covered 56 Yahoo Confidential & Proprietary Identifying… Slow nodes Misconfigured nodes CPU wasting jobs HDFS wasting users Queue congestions
  • 57. Thank You @kojinoguchi We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.