Hadoop operations-2015-hadoop-summit-san-jose-v5

Hadoop Operations –
Best Practices from the Field
June 11, 2015
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
Suresh Srinivas
email: suresh@hortonworks.com
twitter: @suresh_m_s

© Hortonworks Inc. 2011
About Me
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Page 2
Architecting the Future of Big Data

Agenda
• Analysis of Hadoop Support Cases
– Support case trends
– Configuration
– Software Improvements
• Key Learnings and Best Practices
– HDFS ACLs
– HDFS Snapshots
– Reporting DataNode Volume Failures
Page 3

Support Case Trends – Proportional Cases per Month
Page 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
HDFS
Map Reduce
YARN
Other (37 components)

Support Case Trends – Root Cause
Page 5
0
200
400
600
800
1000
1200
Customer Environment
(Non HDP)
Documentation Defect Documentation Gap Documentation Not
Utilized
Education -
Configuration
Needs Training Product Defect
YARN
Map Reduce
HDFS

Support Case Trends
• Core Hadoop components (HDFS, YARN and MapReduce) are used across all deployments, and
therefore receive proportionally more support cases than other ecosystem components.
• Misconfiguration is the dominant root cause.
• Documentation is a close second.
• We are constantly improving the code to eliminate operational issues, help with diagnosis and
provide increased visibility.
• Best practices get incorporated into Apache Ambari for improved defaults, simplified
configuration and deeper monitoring.
Page 6

Configuration - Hardware and Cluster Sizing
• Considerations
–Larger clusters heal faster on nodes or disk failure
–Machines with huge storage take longer to recover
–More racks give more failure domains
• Recommendations
– Get good-quality commodity hardware
– Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores
– More memory is better – real time is memory hungry!
– Before considering fatter machines (1U 6 disks vs. 2U 12 disks)
– Get to 30-40 machines or 3-4 racks
–Use pilot cluster to learn about load patterns
– Balanced hardware for I/O, compute or memory bound
– More details - http://tinyurl.com/hwx-hadoop-hw
Page 8

Configuration – JVM Tuning
• Avoid JVM issues
– Use 64 bit JVM for all daemons
– Compressed OOPS enabled by default (6 u23 and later)
– Java heap size
– Set same max and starting heapsize, Xmx == Xms
– Avoid java defaults – configure NewSize and MaxNewSize
– Use 1/8 to 1/6 of max size for JVMs larger than 4G
– Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB
– Use low-latency GC collector
– -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N>
– High <N> on Namenode and JobTracker or ResourceManager
– Important JVM configs to help debugging
– -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails
– -XX:ErrorFile=<file>
– -XX:+HeapDumpOnOutOfMemoryError
Page 9

Configuration
• Deploy with QuorumJournalManager for high availability
• Configure open fd ulimit
– Default 1024 is too low
– 16K for datanodes, 64K for Master nodes
• Use version control for configuration!
Page 10

Configuration
• Use disk fail in place for datanodes: dfs.datanode.failed.volumes.tolerated
– Disk failure is no longer datanode failure
– Especially important for large density nodes
• Set dfs.namenode.name.dir.restore to true
– Restores NN storage directory during checkpointing
• Take periodic backups of namenode metadata
– Make copies of the entire storage directory
• Set aside a lot of disk space for NN logs
– It is verbose – set aside multiple GBs
– Many installs configure this too small
– NN logs roll with in minutes – hard to debug issues
Page 11

Configuration – Monitoring Usage
• Cluster storage, nodes, files, blocks grows
– Update NN heap, handler count, number of DN xceivers
– Tweak other related config periodically
• Monitor the hardware usage for your work load
– Disk I/O, network I/O, CPU and memory usage
– Use this information when expanding cluster capacity
• Monitor the usage with HADOOP metrics
– JVM metrics – GC times, Memory used, Thread Status
– RPC metrics – especially latency to track slowdowns
– HDFS metrics
– Used storage, # of files and blocks, total load on the cluster
– File System operations
– MapReduce Metrics
– Slot utilization and Job status
• Tweak configurations during upgrades/maintenance on an ongoing basis
Page 12

HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Install & Configure: Ambari Guided Configuration
Guide configuration and provide
recommendations for the most
common settings.
(HBase Example Shown here)

Software Improvements
Real Incidents and Software Improvements to Address Them

Don’t edit the metadata files!
• Editing can corrupt the cluster state
– Might result in loss of data
• Real incident
– NN misconfigured to point to another NN’s metadata
– DNs can’t register due to namespace ID mismatch
– System detected the problem correctly
– Safety net ignored by the admin!
– Admin edits the namenode VERSION file to match ids
Mass deletion of unknown blocks that do not
exist in that namespace
Page 15

Improvement
• Pause deletion of blocks when the namenode starts up
– https://issues.apache.org/jira/browse/HDFS-6186
– Supports configurable delay of block deletions after NameNode startup
– Gives an admin extra time to diagnose before deletions begin
• Show when block deletion will start after NameNode startup in WebUI
– The web UI already displayed the number of pending block deletions
– This enhanced the display to indicate when actual deletion will begin
Page 16

Block Deletion Start Time
Page 17
New

Guard Against Accidental Deletion
• rm –r deletes the data at the speed of Hadoop!
– ctrl-c of the command does not stop deletion!
– Undeleting files on datanodes is hard & time consuming
– Immediately shutdown NN, unmount disks on datanodes
– Recover deleted files
– Start namenode without the delete operation in edits
• Enable Trash
• Real Incident
– Customer is running a distro of Hadoop with trash not enabled
– Deletes a large dir (100 TB) and shuts down NN immediately
– Support person asks NN to be restarted to see if trash is enabled!
Blocks start deleting
Page 18

Improvement
• HDFS Snapshots
– A snapshot is a read-only point-in-time image of part of the file system
– A snapshot created before a deletion can be used to restore deleted data
– More coverage of snapshots later in the presentation
• HDFS ACLs
– Finer-grained control of file permissions can help prevent an accidental deletion
– More coverage of ACLs later in the presentation
Page 19

Unexpected error during HA HDFS upgrade
• Background: HDFS HA Architecture
– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
• Real Incident
– During upgrade, NameNode calls every JournalNode to request backup of metadata directory, which renames
“current” directory to “previous.tmp”.
– Permissions incorrect on metadata directory for 1 out of 3 JournalNodes.
– The hdfs user is not authorized to rename. Backup fails for that JournalNode, so upgrade process aborts with
error.
Root cause not easily identifiable, long time to
recover
Page 20

Improvement
• Improve diagnostics on storage directory rename operations by using native code.
– Logs additional root cause information for rename failure. For example, EACCES
• Split error checks in into separate conditions to improve diagnostics.
– Splits a log message about failure to delete or rename into separate log messages to clarify which specific action
failed
• When aborting NameNode or JournalNode, write the contents of the metadata directories and
permissions to logs.
– Usually the first information asked of the user, so we can automate this
• For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that
the operation can succeed.
– Prevents need for manual cleanup on 2 out of 3 JournalNodes where backup succeeded
Page 21

Key Learnings and Best Practices
Features that Help Improve Production Operations

HDFS ACLs
• Existing HDFS POSIX permissions good, but not flexible enough
– Permission requirements may differ from the natural organizational hierarchy of users and groups.
• HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX
ACL model.
– An ACL (Access Control List) provides a way to set different permissions for specific named users or named
groups, not only the file’s owner and file’s group.
Page 23

HDFS File Permissions Example
• Authorization requirements:
–In a sales department, they would like a single user Maya (Department Manager) to
control all modifications to sales data
–Other members of sales department need to view the data, but can’t modify it.
–Everyone else in the company must not be allowed to view the data.
• Can be implemented via the following:
Read/Write perm for user
maya
User
Group
Read perm for group sales
File with sales data

HDFS ACLs
• Problem
–No longer feasible for Maya to control all modifications to the file
– New Requirement: Maya, Diane and Clark are allowed to make modifications
– New Requirement: New group called executives should be able to read the sales data
–Current permissions model only allows permissions at 1 group and 1 user
• Solution: HDFS ACLs
–Now assign different permissions to different users and groups
Owner
Group
Others
HDFS
Directory
… rwx
… rwx
… rwx
Group D … rwx
Group F … rwx
User Y … rwx

HDFS ACLs
New Tools for ACL Management (setfacl, getfacl)
– hdfs dfs -setfacl -m group:execs:r-- /sales-data
– hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group:
sales user::rw- group::r-- group:execs:r-- mask::r-- other::--
– How do you know if a directory has ACLs set?
– hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0
2014-03-04 16:31 /sales-data

HDFS ACLs Best Practices
• Start with traditional HDFS permissions to implement most permission requirements.
• Define a smaller number of ACLs to handle exceptional cases.
• A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that
has only traditional permissions.
Page 27

HDFS Snapshots
• HDFS Snapshots
– A snapshot is a read-only point-in-time image of part of the file system
– Performance: snapshot creation is instantaneous, regardless of data size or subtree depth
– Reliability: snapshot creation is atomic
– Scalability: snapshots do not create extra copies of data blocks
– Useful for protecting against accidental deletion of data
• Example: Daily Feeds
hdfs dfs -ls /daily-feeds
Found 5 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13
Page 28

HDFS Snapshots
• Create a snapshot after each daily load
hdfs dfsadmin -allowSnapshot /daily-feeds
Allowing snaphot on /daily-feeds succeeded
hdfs dfs -createSnapshot /daily-feeds snapshot-to-2014-10-17
Created snapshot /daily-feeds/.snapshot/snapshot-to-2014-10-17
• User accidentally deletes data for 2014-10-16
hdfs dfs -ls /daily-feeds
Found 4 items
Page 29

HDFS Snapshots
• Snapshots to the rescue: the data is still in the snapshot
hdfs dfs -ls /daily-feeds/.snapshot/snapshot-to-2014-10-17
Found 5 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-13
• Restore data from 2014-10-16
hdfs dfs -cp /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 /daily-feeds
Page 30

Reporting DataNode Volume Failures
• Configuring dfs.datanode.failed.volumes.tolerated > 0 enables a DataNode to keep running after
volume failures
• DataNode is still running, but capacity is degraded
• HDFS already provided a count of failed volumes for each DataNode, but no further details
• Apache Hadoop 2.7.0 provides more information: failed path, estimated lost capacity and failure
date/time
• An administrator can use this information to prioritize cluster maintenance work
Page 31

Page 32
New

Page 33

Page 34

• Everything in the web UI is sourced from standardized Hadoop metrics
– Each DataNode publishes its own metrics
– NameNode publishes aggregate information from every DataNode
• Metrics accessible through JMX or the HTTP /jmx URI
• Integrated in Ambari
• Can be integrated into your preferred management tools and ops dashboards
Page 35

New System to Manage the Health of Hadoop
Clusters
• Ambari Alerts are installed and configured by default
• Health Alerts and Metrics managed via Ambari Web

Summary
• Configuration
– Prevent garbage collection issues
– Configure for redundancy
– Retune configuration in response to metrics
• HDFS ACLs
– Implement fine-grained authorization rules on files
– Can protect against accidental file manipulations
• HDFS Snapshots
– Point-in-time image of part of the filesystem
– Useful for restoring to a prior state after accidental file manipulation
• Reporting DataNode Volume Failures
– Metrics and web UI exposing information about volume failures on DataNodes
– Useful for planning cluster maintenance work
• Use Ambari
– Helps install, configure, monitor and manage Hadoop clusters
– Incorporates the latest best practices
Page 37

Thank you, Q&A
Resource Location
Hardware
Recommendations for
Apache Hadoop
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
HDFS operational and
debuggability
improvements
https://issues.apache.org/jira/browse/HDFS-6185
HDFS ACLs Blog Post http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/
HDFS Snapshots Blog Post http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/
Learn more
Contact me with your operations questions and suggestions
Chris Nauroth – cnauroth@hortonworks.com

Hadoop operations-2015-hadoop-summit-san-jose-v5

More Related Content

What's hot

Viewers also liked

Similar to Hadoop operations-2015-hadoop-summit-san-jose-v5

Recently uploaded

Hadoop operations-2015-hadoop-summit-san-jose-v5

Editor's Notes