With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
2. www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives
At the end of this module, you will be able to
Hadoop Cluster introduction
Recommended Configuration for cluster
Hadoop cluster running modes
Hadoop Security with Kerberos
HDFS Security with ACLs (Access Control Lists )
Hadoop Admin Responsibilities
Demo on Security
4. Slide 4
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
www.edureka.co/hadoop-admin
5. www.edureka.co/hadoop-adminSlide 5
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement
‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a
JBOD* configuration
» Do not use RAID!
» 2 x Quad-core CPUs
» 24 -32GB RAM
» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores
+ 6-8GB RAM) generally work well
for many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
6. www.edureka.co/hadoop-adminSlide 6
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM
Suitable for running MapReduce programs during development
Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
7. Slide 7 www.edureka.in/hadoop-admin
Security issues in Hadoop Cluster
Unauthorized clients can impersonate authorized users and access the cluster
Get the blocks directly from the Data nodes by bypassing the Name node
Eavesdropping of data packets being sent by Data nodes to client
Not all users should have access to sensitive data
No User verification for Map Reduce code execution, malicious users could submit a job
Insecure Network Transport
No Message level security
8. Slide 8 www.edureka.in/hadoop-admin
Hadoop security considerations
Authentication
Authorization
Access control
Data masking and encryption
Network security
Integrity
Confidentiality
Audits and event monitoring
10. Slide 10 www.edureka.in/hadoop-admin
Kerberos to the rescue
Network authentication protocol
Developed at MIT in the mid 1980s
Easy for administrators to manage passwords by storing them centrally
Enhance security by ensuring no clear text passwords are transmitted
Allow users to access different services with the same password
Available as open source or in supported commercial software
11. Slide 11 www.edureka.in/hadoop-admin
Kerberos Design Requirements
Interactions between hosts and clients should be encrypted.
Must be convenient for users (or they won’t use it).
Protect against intercepted credentials.
Kerberos is based on the Secret-Key Distribution Model
-keys are the basis of authentication in Kerberos
-typically a short sequence of bytes.
-used to both encrypt & decrypt
12. Slide 12 www.edureka.in/hadoop-admin
Kerberos Components & Terminology
Kerberos Client
Kerberos Server
Kerberos Key Distribution Center ( KDC )
Authentication Server ( AS )
Ticket-Granting Server ( TGS )
Users and Services in a Kerberos realm are know as Principals.
13. Slide 13 www.edureka.in/hadoop-admin
Kerberos to the rescue
Kerberos Integration
User Authentication
User and Group access control list at
cluster level
Tokens
Delegation
Job
Block Access
Simple Authentication and Security Layer
(SASL) with RPC digest mechanism
Server
1: Authentication
Get TGT
2: Authorization
Get Service Ticket
3: Service Request
Start Service Session
Kerberos Key Distribution Center
Authentication
Server
Ticket Granting
Server
Client
14. Slide 14 www.edureka.in/hadoop-admin
Kerberos to the rescue
Server
Kerberos Key Distribution Center
Authentication
Server
Ticket Granting
Server
Client
1.Request TGT (Auth)
2.Responds with encrypted session key + TGT (TGT + Sk1)
3. Request Service ticket by providing TGT
4. Encrypted session key and ticket granted for service access
( TGT + Sk2 )
5. Authenticates with Service Ticket
(Auth + TGT)
6. Server responds with encrypted timestamp ( Sk2 + Auth )
(Auth + TGT)
Auth -> Authenticator
TGT -> Ticket Granting Ticket
Sk1 Sk2 -> Session Key
15. Slide 15 www.edureka.in/hadoop-admin
Kerberos advantages
A password never travels over the network. Only time-sensitive tickets travel over the network.
Passwords or secret keys are only known to the KDC and the principal.
Kerberos supports passwords or secret keys to be stored in a centralized credential store that is LDAP-
complaint. This makes it easy for the administrators to manage the system and the users.
Servers don't have to store any tickets or any client-specific details to authenticate a client.
17. Slide 17 www.edureka.in/hadoop-admin
HDFS Permissions ( ACLs )
HDFS has supported a permission model equivalent to traditional Unix permission
For each file or directory, permissions are managed for a set of 3 distinct user classes
Owner
Group
Others
There are 3 different permissions controlled for each user class
Read
Write
Execute
For files : The r permission is required to read the file, and the w permission is required to write or append to
the file.
For directories : the r permission is required to list the contents of the directory, the w permission is required
to create or delete files or directories, and the x permission is required to access a child of the directory.
18. Slide 18 www.edureka.in/hadoop-admin
HDFS Permissions ( ACLs )
Each client process that accesses HDFS has a two-part identity composed of the user name, and groups list.
Whenever HDFS must do a permissions check for a file or directory foo accessed by a client process
1. If the user name matches the owner of foo, then the owner permissions are tested
2. Else if the group of foo matches any of member of the groups list, then the group permissions are
tested
3. Otherwise the other permissions of foo are tested.
4. If a permissions check fails, the client operation fails.
19. Slide 19 www.edureka.in/hadoop-admin
ACLs Shell Commands
hdfs dfs -getfacl [-R] <path>
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then
getfacl also displays the default ACL.
hdfs dfs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>]
Sets Access Control Lists (ACLs) of files and directories.
hdfs dfs -ls <args>
The output of ls will append a ‘+’ character to the permissions string of any file or directory that has an
ACL.
21. www.edureka.co/hadoop-adminSlide 21
Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.
22. LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 22 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?