Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hdfs 2016-hadoop-summit-san-jose-v4

1,307 views

Published on

The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.

Published in: Software

Hdfs 2016-hadoop-summit-san-jose-v4

  1. 1. HDFS: Optimization, Stabilization and Supportability June 28, 2016 Chris Nauroth email: cnauroth@hortonworks.com twitter: @cnauroth Arpit Agarwal email: aagarwal@hortonworks.com twitter: @aagarw
  2. 2. © Hortonworks Inc. 2011 About Us Chris Nauroth • Member of Technical Staff, Hortonworks – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Major contributor to HDFS ACLs, Windows compatibility, and operability improvements • Hadoop user since 2010 – Prior employment experience deploying, maintaining and using Hadoop clusters Arpit Agarwal • Member of Technical Staff, Hortonworks – Apache Hadoop Committer, PMC Member – Major contributor to HDFS Heterogeneous Storage Support, Windows Compatibility Page 2 Architecting the Future of Big Data
  3. 3. © Hortonworks Inc. 2011 Motivation • HDFS engineers are on the front line for operational support of Hadoop. – HDFS is the foundational storage layer for typical Hadoop deployments. – Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem. – Conversely, application problems can become visible at the layer of HDFS operations. • Analysis of Hadoop Support Cases – Support case trends reveal common patterns for HDFS operational challenges. – Those challenges inform what needs to improve in the software. • Software Improvements – Optimization: Identify and mitigate bottlenecks. – Stabilization: Prevent unusual circumstances from harming cluster uptime. – Supportability: When something goes wrong, provide visibility and tools to fix it. Thank you to the entire community of Apache contributors. Page 3 Architecting the Future of Big Data
  4. 4. © Hortonworks Inc. 2011 Performance • Garbage Collection – NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.). – Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are larger, therefore the memory footprint has increased for tracking block state) – Much has been written about garbage collection tuning for large heap JVM processes. – In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage collection pressure. Page 4 Architecting the Future of Big Data
  5. 5. © Hortonworks Inc. 2011 Performance • Block Reporting – The process by which DataNodes report information about their stored blocks to the NameNode. – Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently. – Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently. – All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency directly. – However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user operations sufficiently. Page 5 Architecting the Future of Big Data
  6. 6. © Hortonworks Inc. 2011 HDFS-7435: PB encoding of block reports is very inefficient • Block report RPC message encoding can cause memory allocation inefficiency and garbage collection churn. – HDFS RPC messages are encoded using Protocol Buffers. – Block reports encode each block as a sequence of 3 64-bit long fields. – Behind the scenes, this becomes an ArrayList<Integer> with a default capacity of 10. – DataNodes almost always send a larger block report than this, so array reallocation churn is almost guaranteed. – Boxing and unboxing causes additional allocation requirements. • Solution: a more GC-friendly encoding of block reports. – Take over serialization directly. – Manually encode number of longs, followed by list of primitive longs. – Eliminates ArrayList reallocation costs. – Eliminates boxing and unboxing costs by deserializing straight to primitive long. Page 6 Architecting the Future of Big Data
  7. 7. © Hortonworks Inc. 2011 HDFS-9710: Change DN to send block receipt IBRs in batches • Incremental block reports trigger multiple RPC calls. – When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately. – Even multiple block receipts translate to multiple individual incremental block report RPCs. – With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the NameNode to process. • Solution: batch multiple block receipt events into a single RPC message. – Reduces RPC overhead of sending multiple messages. – Scales better with respect to number of nodes and number of blocks in a cluster. Page 7 Architecting the Future of Big Data
  8. 8. © Hortonworks Inc. 2011 Liveness • "...make progress despite the fact that its concurrently executing components ("processes") may have to "take turns" in critical sections..." -Wikipedia • DataNode Heartbeats – Responsible for reporting health of a DataNode to the NameNode. – Operational problems of managing load and performance can block timely heartbeat processing. – Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and asynchronous dispatch of commands (e.g. delete block). • Blocked heartbeat processing can cause cascading failure and downtime. – Blocked heartbeat processing: looks the same as DataNode not running at all. – DataNodes not running: flagged by the NameNode as stale, then dead. – Multiple stale DataNodes: reduced cluster capacity. – Multiple dead DataNodes: storm of wasteful re-replication activity. Page 8 Architecting the Future of Big Data
  9. 9. © Hortonworks Inc. 2011 HDFS-9239: DataNode Lifeline Protocol: an alternative protocol for reporting DataNode health • The lifeline keeps the DataNode alive, despite conditions of unusually high load. – Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by DataNodes. – Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for asynchronous command dispatch, and therefore do not need to contend on a shared lock. – Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive. – Prevents erroneous and costly re-replication activity. Page 9 Architecting the Future of Big Data
  10. 10. © Hortonworks Inc. 2011 HDFS-9311: Support optional offload of NameNode HA service health checks to a separate RPC server. • RPC offload of HA health check and failover messages. – Similar to problem of timely heartbeat message delivery. – NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the NameNode. – Messages are related to handling periodic health checks and initiating shutdown and failover if necessary. – A NameNode overwhelmed with unusually high load cannot process these messages. – Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged outage period. – The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the case of unusually high load. Page 10 Architecting the Future of Big Data
  11. 11. © Hortonworks Inc. 2011 Optimizing Applications • HDFS Utilization Patterns – Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS. – FileSystem API unfortunately can make it too easy to implement inefficient call patterns. Page 11 Architecting the Future of Big Data
  12. 12. © Hortonworks Inc. 2011 HIVE-10223: Consolidate several redundant FileSystem API calls. • Hadoop FileSystem API can cause applications to make redundant RPC calls. • Before: if (fs.isFile(file)) { // RPC #1 ... } else if (fs.isDirectory(file)) { // RPC #2 ... } • After: FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC if (fileStatus.isFile()) { // Local, no RPC ... } else if (fileStatus.isDirectory()) { // Local, no RPC ... } • Good for Hive, because it reduces latency associated with NameNode RPCs. • Good for the whole ecosystem, because it reduces load on the NameNode, a shared service. Page 12 Architecting the Future of Big Data
  13. 13. © Hortonworks Inc. 2011 PIG-4442: Eliminate redundant RPC call to get file information in HPath. • A similar story of redundant RPC within Pig code. • Before: long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1 short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2 • After: FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC long blockSize = fileStatus.getBlockSize(); // Local, no RPC short replication = fileStatus.getReplication(); // Local, no RPC • Revealed from inspection of HDFS audit log. – HDFS audit log shows a record of each file system operation executed against the NameNode. – This continues to be one of the most significant sources of HDFS troubleshooting information. – In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a Pig job submission. Page 13 Architecting the Future of Big Data
  14. 14. © Hortonworks Inc. 2011 Managing NameNode Load • NameNode no longer a single point of failure –However NameNode performance can still be a bottleneck • Assumption that applications will be well-behaved • A single inefficient job can easily overwhelm the NameNode with too much RPC load. Page 14 Architecting the Future of Big Data
  15. 15. © Hortonworks Inc. 2011 Hadoop RPC Architecture • Hadoop RPC admits incoming calls into a shared queue. • Worker threads consume incoming calls from that shared queue and process them • In an overloaded situation, calls spend longer waiting in the queue for a worker thread to become available. • If the RPC queue overflows, requests are queued in the OS socket buffers. –More buffering leads to higher RPC latencies and potentially client side timeouts. –Timeouts often result in job failures and restarts –Restarted jobs cause more work - positive feedback loop. • Affects all callers, not just the caller that triggered the unusually high load. Page 15
  16. 16. © Hortonworks Inc. 2011 HADOOP-10597: RPC Server signals backoff to clients when all request queues are full • If an RPC server’s queue is full, respond to new requests with a backoff signal. • Clients react by performing exponential backoff before retrying the call. –Reduce job failures by avoiding client timeouts • Improves QoS for clients when server is under heavy load. • RPC calls that would have timed out will instead succeed, but with longer latency. Page 16 Architecting the Future of Big Data
  17. 17. © Hortonworks Inc. 2011 HADOOP-10282: FairCallQueue • Replace single RPC queue with multiple prioritized queues. • Server maintains sliding window of RPC request counts, by user. • New RPC calls placed into queues with priority based on the calling user’s history • Calls are de-queued and processed with higher probability from higher-priority queues • De-prioritizes heavy users under high load, prevents starvation of other jobs • Complements RPC Congestion Control. Page 17 Architecting the Future of Big Data
  18. 18. © Hortonworks Inc. 2011 HADOOP-12916: Allow RPC scheduler/CallQueue backoff using response times • Flexible back-off policies. – Triggering backoff when the queue is full is often too late. – Clients may be already experiencing timeouts before the RPC queue overflows. • Instead, track call response time and trigger backoff when response time exceeds bounds. • Further reduces the probability of client timeouts and hence reduces job failures. Page 18 Architecting the Future of Big Data
  19. 19. © Hortonworks Inc. 2011 HADOOP-13128: Manage Hadoop RPC resource usage via resource coupon (proposed feature) • Multi-tenancy is a key challenge in large enterprise deployments. • Allows HDFS and the YARN ResourceManager to coordinate allocation of RPC resources to multiple applications running concurrently in a multi-tenant deployment. • FairCallQueue can lead to priority inversion – NameNode is not aware of relative priorities of YARN jobs – Requests from a high priority application can be demoted to a lower-priority RPC call queue. – Resource coupon presented by incoming RPC requests. • Allow the Resource Manager to request a slice of NameNode capacity via a coupon. Page 19 Architecting the Future of Big Data
  20. 20. © Hortonworks Inc. 2011 Logging • Logging requires a careful balance. • Too much logging causes – Information overload – Increased system load - Rendering strings is expensive, creates garbage • Too little logging hides valuable operational information. Page 20 Architecting the Future of Big Data
  21. 21. © Hortonworks Inc. 2011 Too much logging • Benign errors can confuse administrators – INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 32 on 8021, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from 192.168.22.1:60216 Call#9371 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby – ERROR datanode.DataNode (DataXceiver.java:run(278)) – myhost.hortonworks.com:50010:DataXceiver error processing unknown operation src: /127.0.0.1:60681 dst: /127.0.0.1:50010 java.io.EOFException Page 21
  22. 22. © Hortonworks Inc. 2011 Logging Pitfalls • Forgotten guard logic. – if (LOG.isDebugEnabled()) { LOG.debug(“Processing block: “ + block); // expensive toString() implementation! } • Switching the logging API to SLF4J can eliminate the need for log-level guards in most cases. – LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled • Logging in a tight loop. • Logging while holding a shared resource, such as a mutually exclusive lock. Page 22 Architecting the Future of Big Data
  23. 23. © Hortonworks Inc. 2011 HDFS-9434: Recommission a datanode with 500k blocks may pause NN for 30 seconds • Logging is too verbose – Summary of patch: don’t log too much! – Move detailed logging to debug or trace level. • Before: LOG.info("BLOCK* processOverReplicatedBlock: " + "Postponing processing of over-replicated " + block + " since storage + " + storage + "datanode " + cur + " does not yet have up-to-date " + "block information."); • After: LOG.trace("BLOCK* processOverReplicatedBlock: Postponing {}" + " since storage {} does not yet have up-to-date information.", block, storage); Page 23 Architecting the Future of Big Data
  24. 24. © Hortonworks Inc. 2011 Troubleshooting • Metrics are vital for diagnosis of most operational problems. – Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike) – Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls) Page 24 Architecting the Future of Big Data
  25. 25. © Hortonworks Inc. 2011 HDFS-6982: nntop • Find activity trends of HDFS operations. – HDFS audit log contains a record of each file system operation to the NameNode. 2015-11-16 21:00:00,109 INFO FSNamesystem.audit: allowed=true ugi=bob (auth:SIMPLE) ip=/192.168.1.5 cmd=listStatus src=/app-logs/pcd_batch/application_1431545431771/ dst=null perm=null – However identifying sources of load from audit log requires ad-hoc scripting. • nntop: HDFS operation counts aggregated per operation and per user within time windows. – TopUserOpCounts - default time windows of 1 minute, 5 minutes, 25 minutes – curl 'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’ Page 25 Architecting the Future of Big Data
  26. 26. © Hortonworks Inc. 2011 nnTop sample Output "windowLenMs": 60000, "ops": [ { "opType": "create", "topUsers": [ { "user": "alice@EXAMPLE.COM", "count": 4632 }, { "user": "bob@EXAMPLE.COM", "count": 1387 } ], "totalCount": 6019 } ... Page 26
  27. 27. © Hortonworks Inc. 2011 Troubleshooting Kerberos • Kerberos is hard. – Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration. – Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration. – When it doesn’t work, finding root cause is challenging. Page 27
  28. 28. © Hortonworks Inc. 2011 HADOOP-12426: kdiag • Kerberos misconfiguration diagnosis. – DNS – Hadoop configuration files – KDC configuration • kdiag: a command-line tool for diagnosis of Kerberos problems – Prints various environment variables, Java system properties and Hadoop configuration options related to security. – Attempt a login. – If keytab used, print principal information from keytab. – Print krb5.conf. – Validate kinit executable (used for ticket renewals). Page 28 Architecting the Future of Big Data
  29. 29. © Hortonworks Inc. 2011 kdiag Sample Output - misconfigured DNS [hdfs@c6401 ~]$ hadoop org.apache.hadoop.security.KDiag == Kerberos Diagnostics scan at Mon Jun 27 23:13:40 UTC 2016 == 16/06/27 23:13:40 ERROR security.KDiag: java.net.UnknownHostException: java.net.UnknownHostException: c6401.ambari.apache.org: c6401.ambari.apache.org: unknown error at java.net.InetAddress.getLocalHost(InetAddress.java:1505) at org.apache.hadoop.security.KDiag.execute(KDiag.java:266) at org.apache.hadoop.security.KDiag.run(KDiag.java:221) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.security.KDiag.exec(KDiag.java:926) at org.apache.hadoop.security.KDiag.main(KDiag.java:936) ... Page 29
  30. 30. © Hortonworks Inc. 2011 Summary • A variety of recent enhancements have improved the ability of HDFS to serve as the foundational storage layer of the Hadoop ecosystem. • Optimization – Performance – Optimizing Applications • Stabilization – Liveness – Managing Load • Supportability – Logging – Troubleshooting Page 30 Architecting the Future of Big Data
  31. 31. © Hortonworks Inc. 2011 Thank you! Q&A • A few recommended best practices while we address questions… – Enable HDFS audit logs and periodically monitor audit logs/nnTop for unexpected patterns. – Configure service heap settings correctly. – https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref- 80953924-1cbf-4655-9953-1e744290a6c3.1.html – Use dedicated disks for NN metadata directories/journal node directories. – http://hortonworks.com/blog/hdfs-metadata-directories-explained/ – Run balancer (and soon disk-balancer) periodically. – http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer – Monitor for LDAP group lookup performance issues. – https://community.hortonworks.com/content/kbentry/38591/hadoop-and-ldap-usage-load-patterns-and-tuning.html – Use SmartSense for proactive analysis of potential issues and recommended fixes. – http://hortonworks.com/products/subscriptions/smartsense/ Page 31 Architecting the Future of Big Data

×