Effective Hadoop Cluster Management- Impetus White Paper


Published on

For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper

The paper also focuses on the benefits of automated setup, centralized management of multiple Hadoop clusters, and quick provisioning of cloud-based Hadoop clusters.

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Effective Hadoop Cluster Management- Impetus White Paper

  1. 1. WHITE PAPEREffective Hadoop Cluster Management Abstract In this white paper, Impetus Technologies talks about Apache HadoopTM, an Open Source, Java-based free software framework that enables the processing of huge amounts of data through distributed data processing. It talks about how correct and effective provisioning and management plays a crucial role to ensure is the key to ensure a successful HadoopTM cluster environment and thus helps to make individuals HadoopTM working experience a pleasant one. Apart from this the white paper also and discusses the challenges associated with cluster setup, sharing, and management. The paper also focuses on the benefits of automated setup, centralized management of multiple HadoopTM clusters, and quick provisioning of cloud-based HadoopTM clusters. Impetus Technologies, Inc. www.impetus.com April - 2012
  2. 2. Effective Hadoop Cluster Management Table of ContentsIntroduction .............................................................................................. 3Understanding HadoopTM cluster related challenges ............................... 3 Manual operation ........................................................................ 4 Cluster set up ............................................................................... 4 Cluster management ................................................................... 4 Cluster sharing ............................................................................. 4 HadoopTM compatibility and others............................................. 4What is missing? ....................................................................................... 5Solutions space ......................................................................................... 5 Addressing operational challenges .............................................. 5 Addressing cluster set up challenges ........................................... 6 Addressing cluster management challenges ............................... 8 Addressing cluster sharing challenges ......................................... 8 Addressing HadoopTM compatibility related challenges .............. 9Can HadoopTM Cluster Management tools help? ................................... 10The Impetus solution .............................................................................. 11Summary ................................................................................................. 13 2
  3. 3. Effective Hadoop Cluster Management Introduction The HadoopTM framework offers the required support for data-intensive distributed applications. It manages and engages multiple nodes for distributed processing of the large amount of data which is stored locally on individual nodes. The results produced by the individual nodes are then consolidated further to generate the final output. HadoopTM provides Map/Reduce APIs and works on HadoopTM -compatible distributed file systems. HadoopTM sub-components and related tools, such as HBase, Hive, Pig, Zookeeper, Mahout etc. have specific uses and benefits associated with them. Normally, these subcomponents are also used along with HadoopTM and therefore, need set up and configuration. Setting up a standalone or a pseudo-distributed cluster or even a relatively small sized localized cluster is an easy task. On the other hand, manually setting up and managing a production-level cluster in a truly distributed environment requires significant effort, particularly in the area of cluster set up, configuration and management. It is also tedious, time consuming and repetitive in nature. Factors such as HadoopTM vendor, version, bundle type and the target environment add to existing cluster set up and management related complexities. Also, different cluster modes call for different kinds of configurations. Commands and settings change due to alterations in cluster modes, increasing the challenges related to HadoopTM set up and management.Understanding HadoopTM cluster related challenges The challenges associated with HadoopTM can be broadly classified into following: 1. Operational 2. Cluster set up 3. Cluster management 4. Cluster sharing 5. HadoopTM compatibility and others Let us take them up one by one to understand what they mean: Operational challenges Operational challenges mainly arise due to factors like manual operation, console-based, non-friendly interface, as well as interactive and serial execution. 3
  4. 4. Effective Hadoop Cluster ManagementManual operationThe manual mode of execution requires a full-time, totally interactive usersession and consumes a lot of time. At the same time, it is also error-prone dueto the mistakes and omissions that might have occurred. These in turn requirethe entire activity to begin again from scratch.InterfaceAnother factor is using the console-based interface, which is the primary andthe only available default interface, to interact with the HadoopTM cluster. It istherefore, to some extent, also responsible for the serial execution of activities.Cluster set upIn their simplest form, HadoopTM bundles are simple tar files. They need to beextracted and require set up and initialization. Apache HadoopTM bundles,(especially the tarball), do not have any set up support around them. Thedefault way to set up the cluster is totally manual. A sequence of activities hasto be followed depending on cluster mode and HadoopTM version/vendor. Thecluster set up activity involves a lot of complexities and variations due todifferent factors like the cluster set up environment (on premise versus theCloud), cluster mode, component bundle type, vendor and version. On the topof existing complexities, the manual, interactive and attended mode ofoperation increases the challenges.Cluster managementThe current cluster management in HadoopTM offers limited functionalities andat the same time the operation needs to be carried out manually from aconsole-based interface. There is no feature that enables the management ofmultiple clusters from one single location. One needs to change the interface orlog on to different machines in order to manage different clusters.Cluster sharingWith the current way of operation, the task of sharing HadoopTM clusters acrossvarious users and user groups with altogether different requirements is not justchallenging, tedious and time-consuming, but to some extent also insecure.HadoopTM compatibility and othersThe key factors that fall into this category are related to areas like HadoopTM APIcompatibility, working with HadoopTM bundles and bundle formats (tar/RPM)from multiple vendors, HadoopTM versions related operational and commandsdifferences, etc. 4
  5. 5. Effective Hadoop Cluster Management What is missing?After examining the challenges, it is important to understand what is missing.Once we know the missing dimensions, it is possible to overcome or addressmost of the challenges.Missing Dimensions: • Operational support o Automation o Alternate User friendly interface o Monitoring and notifications support • Setup support • Cluster management support • Cluster sharing mechanismWhen we compare HadoopTM with other Big Data solutions (like Greenplum orcommercial solutions such as Aster Data), we find that the Big Data solutionsoffer support around the above mentioned dimensions. This appears to bemissing in HadoopTM.Today, there are tools in the market that address some or the majority of thechallenges mentioned above. The solutions primarily use these missingdimensions around HadoopTM, and address the various pain points.Let us now look at how these dimensions can help deal with the variouschallenges. Solutions spaceAddressing operational challengesThe operational challenges can be addressed using a combination of methodslike applying automation, an alternate user interface with support for updatesand notifications.AutomationAutomation enables unattended, quick, and error-free execution of any activity.Smart automation can take care of various associated factors and situations in acontext aware manner.Automation ensures that the right commands are submitted, even though theparameters may or may not be correct, as they are keyed in by users. With an 5
  6. 6. Effective Hadoop Cluster Managementinput validating interface it is possible to validate user inputs so as to ensurethat only the right parameters are being used.Using an alternate interfaceAs discussed earlier, with a default console based interface, several limitationscrop up, such as serial execution and interactive working. It is possible toovercome this problem by adopting a user-friendly GUI as an alternate interfacethat additionally supports configuration, input validation, and automation andat the same time runs several activities in parallel. An alternate, friendly, userinterface helps in accessing HadoopTM functionality and operations in astreamlined manner.Impetus strongly believes that operational challenges associated with HadoopTMclusters can be addressed to a great extent by using an alternate interface thatsupports automation and provides parallel working support.Thus automation and an alternate interface together offer an easy and betterHadoopTM environment for working.Addressing cluster set up challengesCluster set up activity requires careful execution of pre-defined actions in asituation aware manner where even a minor error or omission due to manualintervention can result in a major setback.While simple automation can handle this problem, some actions may stillrequire user intervention (e.g. accepting the license agreements). Bringing insmart automation into the picture enables a quick set up of the cluster, in ahassle-free and non-interactive manner. The entire cluster set up functionalitycan be offered through a configurable alternate user interface that offerssimple, click-based cluster provisioning through a friendly and highlyconfigurable UI. This in turn utilizes context-aware automation, based on theprovided inputs and can perform multiple activities in parallel.Understanding the difference between setting up a cluster on-premise andover the CloudFor Cloud-based clusters, organizations are required to launch and terminatethe cluster instances. However, the hardware, operating systems, and installedJava versions are mostly uniform, which may differ in the case of on-premisedeployment. For on-premise deployment, it is important to set up password-lessSSH between the nodes, which are not required in the Cloud set up. The setupof the HadoopTM ecosystem components remains the same, regardless of thecluster set up environment. 6
  7. 7. Effective Hadoop Cluster ManagementProvisioning the Cloud-based HadoopTM clusterThe complexities for provisioning the Cloud-based HadoopTM cluster appearprimarily due to the manual operation, which involves steps such as accessingthe Cloud-provider interface to launch the required number of nodes withrequired different hardware configurations, providing inputs for key pairs,security settings, machine images, etc. There is need to open or unblock therequired ports, manually collect individual node IPs and add all these IPaddresses/hostnames to the HadoopTM slave files. After using the Cloud cluster,once again one need to manually terminate all the machines sequentially, byindividually selecting them on the Cloud interface.If the cluster size is small then all these activities can be carried out easily.However, performing these activities manually on a large-sized cluster iscumbersome and may lead to errors. For performing the activities, onecontinuously needs to switch between the interface of the Cloud providers andthe HadoopTM management interface.Bringing in automation into the picture can ease all these activities and help tosave time and effort. One can incorporate automation just by adding simplescripts or by using Cloud provider-specific exposed APIs or alternatively by usinggeneric Cloud APIs, such as JCloud, Simple Cloud, LibCloud, DeltaCloud etc.Cloudera CDH-2 scripts can help launch instances on the Cloud and then enablesetting up HadoopTM over the launched nodes. This is somewhat similar in thecase of Whirr, which uses the JCloud API in the background. 7
  8. 8. Effective Hadoop Cluster ManagementAddressing cluster management challengesAs we have discussed, the key challenge in this area is the lack of appropriatefunctionalities to manage the cluster. It is possible to effectively managingHadoopTM clusters by adopting tools with dedicated and sophisticated supportfor various cluster management capabilities. This may include functionalitiesranging from node management to services, user, configuration, parametersand job management. Additionally, this may also provide adequate support fortemplates of various common required workflows and inputs for managing allthe mentioned entities.The solutions may also support a friendly and configurable way for providingupdates on performance monitoring, progress or status update notifications andalerts. This is a user-friendly approach which actively offers updates on theprogress, notifications for various events and change in status to users, insteadof them seeking or polling for it periodically in a passive manner. Furthermore,the mode of receiving these updates can be customized and configured by theuser/administrator based on individual preferences and the exact requirementsin terms of criticalness. Thus users, depending on their needs, can configure thecommunication channel/medium which can be any one of the online updates,e-mails or SMS notifications.All these functionalities are supported through an alternate, user-friendlyinterface which also automates cluster management activities and offers a wayto work on multiple activities in parallel. If the user-interface is web-based, thenit additionally offers you the ability to access cluster-related functionality fromanywhere, and at any time.Addressing cluster sharing challengesIt needs to be mentioned here that cluster sharing essentially means sharingclusters among the different development and testing team members.The main problem here is the manner in which these clusters are typicallyshared. In the traditional approach, there are two possible ways to share thecluster. The first is about sharing the credentials of a common user account withthe entire set of users. The second approach is about creating separate useraccounts or user groups for each user/user group with whom you are planningto share the cluster.If you share the cluster using the first approach, i.e. sharing common useraccount credentials with all users, regardless of their actual usage or accessrequirement, you are compromising the security of the system. The system (aswell as cluster) and other linked systems are now exposed because this user 8
  9. 9. Effective Hadoop Cluster Managementaccount may have some exclusive privileges that are now available to a broadcategory of all cluster users, regardless of their actual requirement.In the second approach, one needs to create OS level separate user accounts onthe system (in some cases, even on each node of the cluster) with restrictedprivileges. Now this is a complex as well as time consuming task. You not onlyhave to create or set it up, but even need to maintain and update the variousrequirements with time.Impetus strongly suggests using role-based cluster sharing through the alternateUI. This offers a cleaner way to share clusters without compromising onsecurity. Some of the solutions not only allow you to control role-based accessto various cluster management functionalities, they even offer a way toauthenticate users and their roles from a valid existing external userauthentication system. Some of the benefits of using this method include thefact that now users may not be necessarily created at the per machine or OSlevel. Rather they can be created at the solution level, or just reused even fromexisting domain level users. Thus, it will be relatively easy to manage andcontrol users through the admin interface of the solution. Furthermore, basedon requirements, specific roles can be created on-the-fly and assigned to thespecific user accounts in order to restrict or provide access to specifiedfunctionalities for individual user access.Cluster sharing has definite associated benefits. Furthermore, if multiple sharedclusters can be managed from a single centralized location without switchingthe interface or without being logged on to multiple machines, and then thismakes the entire task even easier. It becomes easy to manage and fine tune asingle shared cluster. Users and back-ups too are easy to manage. Whileworking with a shared cluster all cluster users gets performance benefits. Whencompared with non-shared clusters running on individual machines, one saves alot of time required to set up, manage and troubleshoot local clusters onindividual machines. The performance figures received from local clustersrunning on individual user machines are also not the true measure of clusterperformance, as the hardware of individual machines is not always up to themark in terms of being the best configuration.Addressing HadoopTM compatibility related challengesLet us look at the various challenges related to HadoopTM. The very firstchallenge is HadoopTM compatibility. HadoopTM as a technology is still evolvingand has not reached complete maturity yet. This factor in turn, gives way tonumerous challenges, such as API differences across versions, their on wirecompatibility i.e. multiple HadoopTM versions on different nodes of same cluster,and interoperability of multiple versions and their respective components. 9
  10. 10. Effective Hadoop Cluster ManagementSometimes, given configurations may not be supported in certain versions.Problems may also arise due to multiple vendors and vendor-specific features.There can also be complexities related to bundle formats (tar ball and RPM) astheir setup and folder locations (bin/conf) differ. Cluster modes are anotherfactor that demands suitable changes in configuration and command execution.Security is available as a separate patch and needs customized configuration.Issues can also crop up due to vendor-specific solutions such as SPOF/HA andcompatible file systems.It is possible to find work-arounds to partially address the compatibilitychallenges. A complete solution may not be possible, as these problems are theresult of the underlying technology. One can address the compatibilityproblems by adopting HadoopTM Cluster Management Tools which can give youthe option of replacing these incompatible bundles with suitable ones. This willprimarily ensure that all the nodes within the cluster have the same version ofHadoopTM and the respective components installed.Other factors such as version, vendor, bundle format, etc., can also be handledto a great extent by using any HadoopTM cluster management tool thatfacilitates context-aware component set up and management support. The toolcan take care of the differences in files and folder names/locations changes,command changes and configuration differences in accordance with the bundleformat, cluster mode and vendor. Can HadoopTM Cluster Management tools help?Impetus strongly believes that by utilizing these missing dimensions, a tool canoffer a better HadoopTM environment for working. It can therefore improve yourproductivity when working with HadoopTM clusters. Such tools can helpimmensely, as they offer a quick turnaround and enable companies to createnew clusters quickly and in accordance with their specifications. The tools offerautomated set up, thereby leaving no room for error. They also help to minimizethe total cost of the operation by reducing the time and effort required forcluster set up and management.HadoopTM cluster management tools can provide integrated support for allorganizational requirements from one place and one interface. The tools canalso help to set up clusters for different needs for e.g. setting up a cluster fortesting the application across different vendors, distributions and versions andthen benchmarking them on different configurations, loads and environments. 10
  11. 11. Effective Hadoop Cluster ManagementThey help in analyzing the impact of cluster size against different load patternsand then enabling the launch and resize of the cluster, on-the-fly.Among the tools currently available for the effective set up, and management ofHadoopTM clusters are Amazon’s Elastic Map-Reduce, Whirr, Cloudera SCM, andImpetus’ Ankush. The Impetus solutionImpetus’ Ankush, which is a web-based solution, supports the intricaciesinvolved in cluster set up, management, cloud provisioning, and sharing.Ankush offers customers the following benefits: • Centralized management of multiple clusters. This is a very helpful feature as it eliminates the need for a change in the interface or log in to different machines, to manage different clusters. • In the node management area for instance, the solution, through its web interface, supports the listing of existing nodes as well as the addition and removal of the nodes by IP or hostname. • Support for cluster set up in all possible modes. It also performs context-aware auto initialization of configuration parameters based on the cluster mode. The initialization support includes services, configuration initialization, initial node addition and initial job submission. Additionally, it also supports multiple vendors, versions, and bundles for HadoopTM ecosystem components. • For the cloud, the solution supports the launch and termination of entire HadoopTM clusters, for both heterogeneous and homogeneous configurations. Ankush can support all the required Cloud operations from its own UI, so organizations need not access the cloud-specific interface. • Supports centralized management and monitoring of multiple clusters. Individual cluster-based operations can also be managed using the same interface from the same location. • Facilitates support for reconfiguration and upgrade of HadoopTM and other components, the management of keys, users, configuration parameters, services and monitoring. User management supports multiple user roles. It allows role-based access to various cluster functionalities. Only the admin-role based user can perform the operations that affect the state of the cluster. The setup of the cluster, 11
  12. 12. Effective Hadoop Cluster Management components and pre-dependencies like Java and password-less SSH set up are undertaken in an automated fashion. Figure: Ankush–Impetus’ HCM ToolYou can use Ankush to set up and manage local as well as Cloud-based clusters.A web application that is bundled as a war file, the solution is deployable evenat the user level. Ankush furthermore, offers anytime-anywhere access tocluster functionalities through its web-based interface.Ankush is Cloud independent, thereby giving you an option to launch the clusteron other compatible Clouds. Ankush offers a way to quickly apply theconfiguration changes across all the clusters to leverage the performancebenefits. According to Impetus, it helped the company reduce cluster set uptime by 60 percent! Finally, the solution enables bundle set up optimizationacross cluster nodes using parallelism and bundle re-use. 12
  13. 13. Effective Hadoop Cluster Management Summary In conclusion it can be said that, for effective HadoopTM cluster management, automation facilitates quick and error-free execution of activities. It can be applied to make the execution non-interactive and free from human intervention. It can also be used to save extensive time, effort and costs involved in the cluster set up. For quickly setting up the cluster on Cloud all that you need to do is add automation, either simple scripts or by using Cloud APIs (provider-specific exposed API or generic API). Another important takeaway is adopting a user-friendly GUI as an alternate interface that can help address your cluster-sharing problems. It will also support automation and help execute activities in parallel. It must be reiterated here that HadoopTM is still evolving and is yet to reach maturity. It is only possible to address HadoopTM compatibility issues in a partial way. Lastly, using a suitable HadoopTM Cluster Management tool can enable organizations to deal with the pain areas associated with cluster set up, management, and sharing. About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.213.3310 | Email: inquiry@impetus.com Regional Development Centers - INDIA: • New Delhi • Bangalore • Indore • Hyderabad Visit: www.impetus.comDisclaimersThe information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part ofthis document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 13Technologies Inc.