Hadoop in the Enterprise

Uploaded on


More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Hadoop inthe EnterpriseA Dell Technical White PaperBy Joey JablonskiDell | Hadoop White Paper Series
  • 2. Dell | Hadoop White Paper Series: Hadoop in the EnterpriseTable of ContentsIntroduction 3Managing Hadoop as an island versus part of corporate IT 3Top challenges and methods to overcome 4 Automated deployment 4 Configuration management 4 Monitoring and alerting 4Hardware sizing 4 Hadoop node sizing 4 Hadoop cluster sizing parameters 4Hadoop configuration parameters 5Hadoop network design 5 Dell recommended network architecture 6Hadoop security 7 Authentication 7 Authorization 7 Logging 7Why Dell 7About the author 7Special thanks 8About Dell Next Generation Computing Solutions 8References 8To learn more 8This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without expressor implied warranties of any kind.© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.For more information, contact Dell. Dell, the Dell logo, and the Dell badge, and PowerEdge are trademarks of Dell Inc. 2
  • 3. Dell | Hadoop White Paper Series: Hadoop in the EnterpriseIntroductionThis is the second in a series of white papers from Dell about Hadoop. If you are new to Hadoop or Dell’s Hadoopsolutions, Dell recommends reading the “Introduction to Hadoop” white paper before this paper.Hadoop is becoming a critical part of many modern information technology (IT) departments. It is being used for agrowing range of requirements, including analytics, data storage, data processing, and shared compute resources. AsHadoop’s significance grows, it is important that it be treated as a component of your larger IT organization, andmanaged as one. Hadoop is no longer relegated to only research projects, and should be managed as your companywould manage any other large component of your IT infrastructure.Analytics is a growing need in many organizations. As the volumes of data from multiple sources continue to grow,Hadoop is in the lead as an enterprise tool that can ingest that data and provide the means to analyze the data.Data warehouses are a key component to many IT departments. They provide the business intelligence that manycompanies use for decision making. Hadoop is beginning to augment these as a central location to feed many differentbusiness intelligence platforms within an organization. This means that Hadoop must be available at the same level orgreater than the tools it feeds and supports.Hadoop has taken a standard path into most IT organizations, first being used for testing and development, thenmigrating into production operations. Because of this, your IT department must ensure that the Hadoop environment isproperly planned and designed at initial deployment to support the more rigorous demands of a production environment.This paper talks about the considerations for your IT department to ensure your Hadoop environment is able to grow andchange with your business needs, without requiring major refactoring work after the initial deployment due to suboptimalsolution architecture.Managing Hadoop as an island versus part of corporate ITHadoop environments can contain many hundreds or possibly thousands of servers. This large number of devices canbecome a management burden for your IT department. As your environment grows, your IT administrators can be leftstruggling with complexity.Maximum attention should be paid during your initial Hadoop deployment to ensure it does not become an island that ITmust manage outside of standard tools and processes. A Hadoop environment, optimally, should utilize external companyshared resources for authentication, monitoring, backup, alerting, and processes. This integration, early and often, willensure that the Hadoop environment does not consume an unnecessary amount of time from the IT department relativeto other applications within the corporation.The design of a Hadoop solution should be optimized for the performance needs and usage model of the intendedHadoop environment. That is not to say it should completely contradict other solutions in the environment. The Hadoopenvironment should share IT best practices and processes with other solutions in the enterprise to ensure consistency indeployment and operations. This consistency can be either between common hardware or software across ITenvironments. Common hardware will ensure ease of servicing while common software will ensure that tools, scripts, andprocesses do not require major modification for supporting the Hadoop environment.Hadoop deployment can be a time-consuming process. The number of systems involved in a Hadoop environment caneasily overwhelm the most experienced of system administrators. It is important that an automated solution be utilized fordeploying the Hadoop environment, both to save time and ensure consistency. If your enterprise has an existingoperating system (OS) deployment strategy, it should be evaluated to determine if it will be viable for Hadoopdeployment; if not, a vendor deployment strategy for the Hadoop environment should be considered.Operating most IT environments can be more costly than the initial purchase and deployment. Processes for IToperations, support, and escalations should be updated to accommodate the Hadoop environment—you do not want tostart over and create a parallel set of processes and structure. Like any IT environment, Hadoop will require regularmonitoring and user support. Documentation should be updated to accommodate the differences in the Hadoopenvironment, and staff should be trained to ensure long-term support of the Hadoop environment. 3
  • 4. Dell | Hadoop White Paper Series: Hadoop in the EnterpriseTop challenges and methods to overcomeAutomated deploymentAutomated deployment of both operating systems and the Hadoop software ensures consistent, streamlineddeployments. Dell provides proven configurations that are documented and simpler to deploy than traditional manual ITdeployment strategies. Dell augments these proven solutions with services and tools to streamline solution deployment,testing, and field validation.Configuration managementDell recommends the use of a configuration management tool for all Hadoop environments. Dell | Hadoop solutionsinclude Chef for this purpose. Chef is used for deploying configuration changes, managing the installation of the Hadoopsoftware, and providing a single interface to the Hadoop environment for updates and configuration changes.Monitoring and alertingHardware monitoring and alerting is an important part of all dynamic IT environments. Successful monitoring and alertingensures that problems are caught as soon as possible and administrators are alerted so the problems can be correctedbefore users are impacted. Dell provides support for integration with Nagios and Ganglia as part of our Hadoop solutionstack for monitoring the software and hardware environment.Dell also recommends integrating the Hadoop environment into any existing enterprise monitoring and managementpackages. Dell | Hadoop solutions support standard interfaces, including Simple Network Management Protocol (SNMP)and Intelligent Platform Management Interface (IPMI) for integration with third-party management and operations tools.Hardware sizingSizing Hadoop environments is often a trial-and-error-prone process. With Dell | Hadoop solutions, Dell provides tested,known configurations to streamline the sizing process and deliver sizing guidance based on known, real-worldworkloads.There are two aspects to sizing within a Hadoop environment: 1. Individual node size – This is the hardware configuration of each individual node. 2. Hadoop cluster – This is the size of the entire Hadoop environment, including the number of nodes and interconnects between them.Hadoop node sizingThe four primary sizing considerations for Hadoop nodes are physical memory, disk capacity, network bandwidth, andCPU speed and core count. A properly balanced Hadoop node will contain enough in each category, but not create abottleneck to negatively impact performance by a shortage of another.Simpler is better when sizing any Hadoop environment. Feedback from seasoned Hadoop users has always been thesame recommendation: Targeting a middle-ground configuration, with a good balance of the key components of theindividual servers, is better than trying to optimize the hardware design for the expected workload. First, workloadschange much more often than the hardware is replaced. Second, the mix of user workloads will also change over time,and a balanced configuration, while not optimized for a specific use, will not penalize any one job type over another.Hadoop cluster sizing parametersSizing a Hadoop cluster is a different activity from sizing the individual nodes, and should be planned for accordingly.There are separate parameters that must be considered when sizing the entire cluster; these include number of jobs to berun, data volume, expected growth, and number of users. Considerations should also be taken for the amount of data tobe replicated, the number of data replicas, and the availability needs of the Hadoop cluster.Hadoop contains more than 200 tunable parameters, many of which will not influence your specific job. Dell | Hadoopsolutions provide recommended configuration parameters for each of the three use cases of Hadoop: Compute, Storage,and Database. Dell has validated these parameters for documented workloads, streamlining system bring-up and tuningfor new Hadoop environments. 4
  • 5. Dell | Hadoop White Paper Series: Hadoop in the EnterpriseHadoop configuration parametersHadoop tuning and optimization is a never-ending process. This recurring process is depicted in Figure 1. The beginningof each cycle is “Determine parameter to change.” The process then goes in a clockwise fashion and repeats. All Hadoopenvironments will need to be reviewed for tuning and performance optimization as often as the mix of jobs and workloadcharacteristics change.Figure 1. The ongoing Hadoop tuning and optimization process.Hadoop network designThe network design is a key component of a Hadoop environment, and has important factors related to the expectedusage of the environment, as well as the scalability of the environment. The network design should factor in the expectedworkloads, provide for adequate growth of the environment without a major rework, and support monitoring of thenetwork to alert administration staff to network congestion.Hadoop isolation is a primary design requirement for Hadoop clusters. The network that the Hadoop nodes use for node-to-node communication should be isolated from the other corporate networks. This ensures maximum bandwidth forthe Hadoop environment and minimizes the impacts of other operations negatively affecting Hadoop.One important consideration of the network design for Hadoop solutions is the switch capabilities to accommodatehigh-packet-count communication patterns. Hadoop can create large volumes of packets on the network during rebuildoperations as well as during high Hadoop Distributed File System (HDFS) I/O activity. The network architecture shouldensure the switches have the capability to handle this traffic pattern without additional network latency or droppedpackets. The Dell white paper “Introduction to Hadoop” covers our network architecture in greater detail.Like many modern applications, Hadoop utilizes IP for all communications. This means Hadoop has the same restrictionsand design requirements you would see for any other network regarding broadcast domains, network separation, androuting and switching considerations. Dell recommends that Hadoop clusters utilize approximately 60 nodes within asingle switched environment. For clusters larger than that, Dell recommends utilizing a layer 3 network device to segmentthe network and maintain adequately sized broadcast domains.Today, most Hadoop solutions are built utilizing Gigabit Ethernet. Many DataNodes will utilize two Gigabit Ethernetconnections in an aggregated link to the switch. This provides additional bandwidth for the node. Some users arebeginning to look at 10 Gigabit Ethernet as the price continues to come down. The majority of the users today do not 5
  • 6. Dell | Hadoop White Paper Series: Hadoop in the Enterpriserequire the additional performance of 10 Gigabit Ethernet, and can save some cost in their Hadoop environments byutilizing Gigabit Ethernet.Availability of the network infrastructure is a primary concern if the Hadoop environment is critical to business operationsor functions. The network should include the appropriate level of redundancy to ensure business functions will continue ifcomponents within the network fail. Hadoop has facilities within the software for handling failures of the hardware andnetwork; these should be accounted for when determining which tiers of the network will contain redundantcomponents and which will not. You can get a lot of efficiencies by leveraging Hadoop’s capabilities to replicateinformation to separate racks or servers on alternate switches, allowing access in the event of a network failure.Dell recommended network architectureFigure 2 shows the recommended network architecture for the top-of-rack (ToR) switches within a Hadoop environment.These connect directly to the DataNodes and allow for all inter-node communication within the Hadoop environment.The standard configuration Dell recommends is six 48-port Gigabit Ethernet switches (i.e. Dell™ PowerConnect™ 6248)that have stacking capabilities. These six switches will be stacked and act as a single switch for management purposes.This network configuration will support up to 60 DataNodes for Hadoop; additional nodes can easily be added by utilizingan end-of-row (EoR) switch as described in the next section.Figure 2. Recommended network architecture for the top-of-rack switches within a Hadoop environment.Hadoop is a highly scalable software platform that requires a network designed with the same scalability in mind. Toensure maximum scalability, Dell recommends a network architecture that allows you to start with small Hadoopconfigurations and grow those over time by adding components, without requiring a rework of the existing environment.To that design goal, Dell recommends the use of two switches acting as EoR devices. These will connect to the ToRswitches, as shown in Figure 3, but add routing and advanced functionality to scale above 60 DataNodes. These two EoRswitches will allow a maximum of 720 DataNodes within the Hadoop environment before an additional layer of networkconnectivity is needed or larger switches are required for EoR connectivity.Figure 3. Two switches acting as EoR devices and connecting to ToR switches. 6
  • 7. Dell | Hadoop White Paper Series: Hadoop in the EnterpriseHadoop securityAuthenticationHadoop supports a variety of authentication mechanisms, including Kerberos, Active Directory, and Lightweight DirectoryAccess Protocol. These all enable a list of authorized users and their credentials to be centrally stored and validatedagainst from the Hadoop environment. Dell recommends utilizing an existing companywide authentication scheme forHadoop, eliminating the need to support a separate authentication system supporting only Hadoop.AuthorizationAuthorization is an additional layer on top of the authentication that must occur for users. Authentication verifies that theuser credentials are correct—in essence that the user name and password are correct, active, and valid for some period oftime. Authorization builds on that validation to determine if the user is allowed to do the action requested. Authorizationis commonly implemented as file permissions in a file system and access controls within a relational databasemanagement system (RDBMS).By utilizing centralized authentication and authorization for Hadoop and other corporate services, security models can bedeveloped between the environments to ensure permissions are properly mapped across environments. Hadoop iscommonly used as an intermediary point for data processing and storage; it should enforce all corporate security policiesfor data that passes through Hadoop and is processed by Hadoop.LoggingLogging is a critical part of both system operation and security. The logs for Hadoop and the underlying systems provideinsight into unexpected access that could lead to a system or data compromise as well as system stability issues. Hadoopenvironments should utilize a central logging facility for correlation of logs, recovery of logs from failed hosts, and eventalerting of unexpected events and anomalies.In development environments, as well as when your environment grows, it will be beneficial for comparing logs fromHadoop to the logs from the application utilizing Hadoop. This capability enables your administrators to correlateproblems in the entire stack of components that make up a functioning application.Why DellDell has worked with Cloudera to design, test, and support an integrated solution of hardware and software forimplementing the Hadoop ecosystem. This solution has been engineered and validated to work together and provideknown performance parameters and deployment methods. Dell recommends that you utilize known hardware andsoftware solutions when deploying Hadoop to ensure low-risk deployments and minimal compatibility issues. Dell’ssolution ensures that you get maximum performance with minimal testing prior to purchase.Dell recommends that you purchase and maintain support on the entire ecosystem of your Hadoop solution. Today’ssolutions are complex combinations of components that require upgrades as new software becomes available andassistance when staff is working on new parts of the solutions. The Dell | Cloudera Hadoop solution provides a full line ofsupport, including hardware and software, so you always have a primary contact for assistance to ensure maximumavailability and stability of your Hadoop environment.About the authorJoey Jablonski is a principal solution architect with Dell’s Data Center Solutions team. Joey works to define andimplement Dell’s solutions for big data, including solutions based on Apache Hadoop. Joey has spent more than 10 yearsworking in high performance computing, with an emphasis on interconnects, including Infiniband and parallel filesystems. Joey has led technical solution design and implementation at Sun Microsystems and Hewlett-Packard, as well asconsulted for customers, including Sandia National Laboratories, BP, ExxonMobil, E*Trade, Juelich SupercomputingCentre, and Clumeq. 7
  • 8. Dell | Hadoop White Paper Series: Hadoop in the EnterpriseSpecial thanksThe author extends special thanks to:  Aurelian Dumitru, Senior Cloud Solutions Architect, Dell  Rebecca Brenton, Cloud Software Alliances, Dell  Scott Jensen, Director, Cloud Solutions Software Engineering, DellAbout Dell Next Generation Computing SolutionsWhen cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell NextGeneration Computing Solutions are Dell’s response to your unique needs. We understand your challenges—fromcompute and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tuneyour company’s “factory” for maximum performance and efficiency.Dell Next Generation Computing Solutions provide operational models backed by unique product solutions to meet theneeds of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups whileallowing scalability as your company grows.Deployment and support are tailored to your unique operational requirements. Dell Cloud Computing Solutions can helpyou minimize the tangible operating costs that have hyperscale impact on your business results.ReferencesChefhttp://wiki.opscode.com/display/chef/HomeSNMPhttp://www.net-snmp.org/IPMIhttp://www.intel.com/design/servers/ipmi/ipmi.htmCloudera Hadoophttp://www.cloudera.com/products-services/enterprise/ To learn more To learn more about Dell cloud solutions, contact your Dell representative or visit: www.dell.com/cloud© 2011 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerConnect are trademarks of Dell Inc. Other trademarks and trade names may be used in this document torefer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others. This document is for informational purposesonly. Dell reserves the right to make changes without further notice to the products herein. The content provided is as-is and without expressed or implied warranties of any kind. 8