Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Hadoop Admin Best Practices with HDP 2.3
Part-2

 We have INSTRUCTOR LED - both Online LIVE & Classroom Session
 Present for classroom sessions in Bangalore & Delhi (NCR)
 We are the ONLY Education delivery partners for Mulesoft, Elastic, Pivotal & Lightbend in India
 We have delivered more than 5000 trainings and have over 400 courses and a vast pool of over 200 experts to make YOU the
EXPERT!
FOLLOW US ON SOCIAL MEDIA TO STAY UPDATED ON THE UPCOMING WEBINARS

Online and Classroom Training on Technology Courses
at SpringPeople
Certified Partners
Non-Certified Courses
…and many more
…NOW

Covered Till Now
1. Use Ambari – Cluster Management Tool
2. More of WebHDFS Access
3. WebHDFS
4. Use More of HDFS Access Control Lists
5. Use HDFS Quotas
6. Understanding of YARN Components
7. Adding, Deleting, or Replacing Worker Nodes
8. Rack Awareness
9. NameNode High Availability
10. ResourceManager High Availability
11. Ambari Metrics System
12. What to Backup?

13 – Setting appropriate Directory Space Quota
• Best practice is to also set space limits on home directory To set a
12TB limit:
$ hdfs dfsadmin –setSpaceQuota 12t /user/username
• Includes space for replication
• This is the actual use of space
• Example:
• If storing 1TB and replication factor is 3
• 3TB is needed
• Quota can be set on any directory

14 - Configuring Trash
• Enable by setting time delay for trash's checkpoint removal:
In core-site.xml
• fs.trash.interval
• Delay is set in minutes (24 hours would be 1440 minutes)
• Recommendation is to set to 360 minutes (6 hours)
• Setting the value to 0 disables Trash
• Files deleted programmatically are deleted immediately
• Files can be immediately deleted from the command line using -skipTrash

15 - Compression Needs and Tradeoffs
 Compressing data can speed up data-intensive I/O operations
• MapReduce jobs are almost always I/O bound
 Compressed data can save storage space and speed up data transfers across the network
• Capital allocation for hardware can go further
 Reduced I/O and network load can result in significant performance improvements
• MapReduce jobs can finish faster overall
 But, CPU utilization and processing time increase during compression and decompression
• Understanding the tradeoffs is important for MapReduce pipeline’s overall performance

16 - Sqoop Security
• Database Authentication:
• Sqoop needs to authenticate to the RDBMS
• How?
• Usually this involves a username/password
(Oracle Wallet is the exception)
• Can hard code password in scripts (not recommended/used)
• Password usually stored in plaintext in a file protected by the filesystem
• Hadoop Credential Management Framework added in HDP 2.2
• Not a keystore, but a way to interface with keystore backends
• Passwords can be stored in a keystore and not in plain text
• Can help with “no passwords in plaintext” requirements

17 - distcp Configurations
• If Distcp runs out of memory before copying:
• Possible Cause: Number of files/directories being copied from source
path(s) is extremely large (e.g. 100,000 paths)
• Change: HEAP Size
- Export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m”
• Map Sizing
• If -m is not specified: Default to 20 maps max
• Tuning the number of maps to:
- Size of the source and destination cluster
- The size of the copy
- Available bandwidth

18 - Falcon
 Centrally manages data lifecycle
• Centralized definition & management of pipelines for data ingest,
process and export
 Supports Business continuity and Disaster
Recovery
• Out of the box policies for data replication
and retention
• End-to-end monitoring of data pipelines
 Addresses basic audit & compliance requirements
• Visualize data pipeline lineage
• Track data pipeline audit logs
• Tag data with business metadata

19 - Running Balancer
• Can be run periodically as a batch job
• Examples: every 24 hours or weekly
• Run after new nodes have been added to the cluster
• To run balancer:
hdfs balancer [-threshold <threshold>] [-policy <policy>]]
• Runs until there are no blocks to move
or
Until it has lost contact with the NameNode
• Can be stopped with a Ctrl+C

20 - HDFS Snapshots
Create HDFS directory snapshots
Fast operation - only metadata affected
Results in .snapshot/ directory in the HDFS directory
Snapshots are named or default to timestamp
Directories must be made snapshottable
Snapshot Steps:
– Allow snapshot on directory
hdfs dfsadmin -allowSnapshot foo/bar/
– Create snapshot for directory and optionally provide snapshot name
hdfs dfs -createSnapshot foo/bar/ mysnapshot_today
– Verify snapshot
hdfs dfs -ls foo/bar/.snapshot

21 - HDFS Data – Automate & Restore
• Use Falcon/Oozie to automate backups
• Falcon utilizes Oozie as a workflow scheduler
• distcp is an Oozie action
- use -update and -prbugp
• Restoring is the reverse process of backups
1. On your backup cluster choose which snapshot to restore
2. Remove/move target directory on production system
3. Run distcp without -update options

www.springpeople.comtraining@springpeople.com
Upcoming Hortonworks Classes at
SpringPeople
Classroom (Bengaluru)
05 - 08 Sept
26 - 28 Sept
10 - 13 Oct
07 - 10 Nov
05 - 08 Dec
19 - 21 Dec
Online LIVE
22 - 31 Aug
05 - 17 Sept
19 Sept - 01 Oct

Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Similar to Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2 (20)

More from SpringPeople

More from SpringPeople (20)

Recently uploaded

Recently uploaded (20)

Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) 2.3 _Part 2

Editor's Notes