Your SlideShare is downloading. ×
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – Impetus White Paper

338

Published on

For Impetus’ White Papers archive, visit- http://lf1.me/drb/ …

For Impetus’ White Papers archive, visit- http://lf1.me/drb/

This white paper talks about the design considerations for enterprises to run Hadoop as a shared service for multiple departments.

As Hadoop becomes more mainstream and indispensable to enterprises, it is imperative that they build, operate and scale shared Hadoop clusters. The design considerations discussed in this paper will help enterprises accomplish the essential mission of running multi-tenant, multi-use Hadoop clusters at scale.

The white paper talks about Identity, Security, Resource Sharing, Monitoring and Operations on the Central Service.

For Impetus’ White Papers archive, visit- http://lf1.me/drb/

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
338
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Shared Elephant A Shared Central Big Data Repository This white paper talks about the design considerations for enterprises to run Hadoop as a shared service for multiple departments. www.impetus.com
  • 2. Introduction Running an Enterprise Big Data repository requires significant investment in Learn about the considerations for Enterprises to use Hadoop as a shared service for multiple applications and business units. Read about Identity, Security, Resource Sharing, Monitoring and Operations on the Central Service. resources. A dedicated cluster for each department is cost-prohibitive, leading to the creation of Big Data silos and underutilization of cluster resources. Enterprises that run Hadoop at scale should allow Hadoop clusters to be shared by different business units. They must also support multiple use cases as well as a checkin/checkout model for an analytic block of works. We cover some design considerations for identity management, security, resource sharing and monitoring that are essential to build a secure, robust, highly available and shared central Big Data repository. Identity Security is of paramount concern in a shared, multi-tenant environment. Early versions of Hadoop had rudimentary security features, essentially relying on a fair use policy in a trusted environment. Recent versions of Hadoop have added significant identity management features. Let us explore a couple of these in detail. Kerberos Kerberos provides authentication and authorization services. The Kerberos mechanism provides stronger authentication in a more secure fashion than what was available in earlier versions of Hadoop. All clients have to authenticate with a central Kerberos service. Kerberos provides role-based access control and privilege enforcement. Kerberos enforces authentication of data node daemons with the parent services (name node and job tracker). Authentication prevents rogue data nodes from connecting to the parent services and compromising the data stored in the cluster. (Refer to the figure below that demonstrates how Hadoop Kerberos Key Distribution Center Kerberos Authentication works.) Authentication Service Request Session Ticket Session Ticket & Session Key Data Node Name Node Task Tracker HDFS Layer HDFS Data Tenant 1 Job Tracker Task Tracker M/R Layer Tenant 2 Data Node Parent Services Task Tracker HDFS Data Tenant 3 Hadoop Cluster 2 Data Node Hadoop Kerberos Authentication HDFS Data
  • 3. Lightweight Directory Access Protocol (LDAP) Integration LDAP can be used to create user accounts in all of the data nodes. This provides fine-grained access control policies and prevents privilege escalation attacks. Security Hadoop has several security features as listed below: • Running data node daemons on privileged ports. • Running tasks as the job owner instead of the task tracker daemon user. This prevents other users from changing the job and also viewing the local task data. • Preventing users other than the job owner to look at map outputs. • Restricting a task to only communicate with its parent task tracker to prevent rogue users from inspecting map input data. Data Security Hadoop does not natively integrate with data-at-rest encryption solutions. However, the Intel distribution of Hadoop provides fast encryption using Intel hardware enhancements. Hadoop 2.0 provides SSL transport between Hadoop daemons and during the shuffle phase. Sharing Resources Allocating shared resources to different users and groups in a fair and efficient manner poses some unique challenges in Hadoop. Hadoop does not provide policies and SLAs that are typical of shared systems. Hadoop presents the storage layer (HDFS) as a single shared resource but the computational layer (MapReduce) requires some fine-tuning for optimal results. Nevertheless, here are some recommendations on running a user-friendly shared Hadoop cluster. Resource Usage Limits • HDFS Quotas: HDFS provides name quotas and file quotas. Both are very useful to enforce sensible limits on HDFS usage. Designing a sensible shared directory structure is important, since quotas are set at either file or directory level. It is a good practice to have a common directory that is shared across groups and separate quota-limited directories for each group in a shared cluster. • Task Slots: Task slots are configured on a per node basis. They take into account the total capacity of the cluster. Individual jobs are then monitored to determine the number of mappers. A multiple of the number of map slots is the recommended practice. 3
  • 4. Scheduling Hadoop provides different schedulers as plug-ins. That said, not all schedulers are created equal. The FIFO scheduler should not be used as it can lead to significant resource underutilization and job starving. The fair scheduler is a good option for a dedicated cluster but may lead to resource contention in a shared environment. The capacity scheduler is the optimal choice for a shared cluster. The capacity scheduler provides multi-tenancy controls that prevent a user or a group of users from overwhelming the cluster. It also provides capacity guarantees through soft limits and enforceable hard limits. The capacity scheduler additionally improves security by providing ACLs for job queues. Monitoring Hadoop provides good monitoring options. We recommend using Ganglia or similar monitoring for production clusters. JMX monitoring should also be enabled. Recent versions of Hadoop ship with the more flexible metrics2 framework for metrics collection. Using metrics2 in the Ganglia context provides valuable insight into cluster usage. Oozie workflows also enables SLA tracking, which is important for a shared cluster. Operations We have discussed several operational considerations such as security, optimal resource sharing and monitoring. In addition to these, the operations team needs to build a proactive ‘service’ approach that addresses the full range of service components present in a Hadoop environment. Each of these components is a potential point of failure. Operations needs to shift from passive monitoring to actively meeting SLAs in a new distributed environment. This shift in focus necessitates a new organizational culture in addition to operational excellence. Operational Excellence Operational excellence for a shared cluster is not just about cluster health and uptime. Service metrics such as job completion rate, resource sharing and meeting SLAs is also significant. It is important to operationalize the aspects of identity, security, resource sharing and monitoring discussed above. To accomplish these, Hadoop operations need to perform regular audits, fire drills and ensure well documented processes and procedures. A runbook-based troubleshooting guide and well formulated support levels (Level 1, Level 2, and Level 3) with an easy escalation procedure are also required. If SLAs mandate limited service interruption, then the runbooks should have maximum resolution times and mandatory escalation based on severity and time-sensitive resolution. Operational excellence is a function of all of the above. 4
  • 5. Summary Design considerations for multi-tenant, multi-use Hadoop clusters are: • Design for security as part of the initial cluster planning. • Implement user-friendly resource sharing while meeting SLAs. • Use the Capacity Scheduler. • Monitor service metrics in addition to cluster metrics. • Institutionalize operational excellence through streamlined procedures and by cultivating a service mindset. As Hadoop becomes more mainstream and indispensable to enterprises, it is imperative that they build, operate and scale shared Hadoop clusters. The design considerations discussed in this paper will help enterprises accomplish the essential mission of running multi-tenant, multi-use Hadoop clusters at scale. About Impetus © 2013 Impetus Technologies, Inc. Impetus Technologies is a leading provider of Big Data solutions for the All rights reserved. Product and Fortune 500®. We help customers effectively manage the “3-Vs” of Big Data company names mentioned herein and create new business insights across their enterprises. may be trademarks of their respective companies. Oct 2013 #52991 Visit http://bigdata.impetus.com or write to us at bigdata@impetus.com

×