Hadoop Cluster Management

4,288 views

Published on

Published in: Technology, Health & Medicine
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,288
On SlideShare
0
From Embeds
0
Number of Embeds
97
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Hadoop Cluster Management

  1. 1. Hadoop Cluster Management Dheeraj Kapur Principal Engineer, Yahoo! dheerajk@yahoo-inc.com
  2. 2. What it is…§  Workflow based system for cluster management.§  Completely modular & distributed design.§  Has its own JMX based library(can be used to monitor other services on cluster).§  Fully controllable from WebUI.§  Has command line utility for adhoc administration. 2
  3. 3. What it does…§  Manage clusters.§  Break fixing.§  Upgrades OS seamlessly.§  Consistency/efficiency of clusters.§  Proactive self-healing Model.§  User Management. 3
  4. 4. Manage Clusters§  Its has well defined workflow to manage clusters.§  No/Minimal human intervention required.§  Keep up efficiency of cluster.§  Keep track of Missing/Bad blocks on system.§  Well defined WebUI and Command line utility 4
  5. 5. System Overview 5
  6. 6. Workflow 6
  7. 7. Contd.. 7
  8. 8. Command Line Utility 8
  9. 9. Web Interface 9
  10. 10. Web Interface contd… 10
  11. 11. fixing bad/mal-performing nodesThese errors can lead to SLA miss or Job failures§  Takes care of Blacklisted JT nodes.§  Errors like high load average, wrong network speed.§  Parse system logs at X frequency (thru workflows) and look for patterns.§  Visit each node multiple times in a day and check health of node. 11
  12. 12. Upgrade OS§  Upgrade & rollback OS seamlessly.§  Upgrading on production, heavily used clusters. 12
  13. 13. Consistency & efficiency of clusters§  Keep track of cluster MR capacity§  Proactive Fixing of sick nodes, which can cause potential issues. 13
  14. 14. Introducing Proactive self-healing systemLet me set the ground for it.§  Wounded hosts Called Set A - Hosts having issues, but still in service (with degraded services), Which can cause potential SLA misses and job execution issues.(which we have seen in past)§  Fractured Hosts Called Set B - Hosts already in Break fix cycle and getting fixed§  All grid hosts Called Set X - all grid hosts healthy + fine§  Set A & B are sub-set of set X§  to find wounded hosts we have to scan entire infrastructure once a day.§  Calculate Symmetric difference b/w Set A & B, we will get actual wounded hosts needs service. 14
  15. 15. Proactive self-healing contd…. All Grid Hosts - X Set A Set B 15
  16. 16. Proactive self-healing contd…. 16
  17. 17. User Management§  We have one of the most complex and secure environment.§  User access and management is a complex task, due to the number of users, security constraints and complexity involved in provisioning access.§  Single request provisioning requires change at multiple places.§  Well defined workflow based system, where 100% automation is achieved.§  Great help during system audit and compliance. 17
  18. 18. Q&A 18
  19. 19. Thank You 19
  20. 20. Sessions will resume at 4:30pm Page 20

×