SlideShare a Scribd company logo
HDFS: Optimization, Stabilization and
Supportability
April 13, 2016
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
© Hortonworks Inc. 2011
About Me
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Motivation
• HDFS engineers are on the front line for operational support of Hadoop.
– HDFS is the foundational storage layer for typical Hadoop deployments.
– Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem.
– Conversely, application problems can become visible at the layer of HDFS operations.
• Analysis of Hadoop Support Cases
– Support case trends reveal common patterns for HDFS operational challenges.
– Those challenges inform what needs to improve in the software.
• Software Improvements
– Optimization: Identify bottlenecks and make them faster.
– Stabilization: Prevent unusual circumstances from harming cluster uptime.
– Supportability: When something goes wrong, provide visibility and tools to fix it.
Thank you to the entire community of Apache contributors.
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Logging
• Logging requires a careful balance.
– Too little logging hides valuable operational information.
– Too much logging causes information overload, increased load and greater garbage collection overhead.
• Logging APIs
– Hadoop codebase currently uses a mix of logging APIs.
– Commons Logging and Log4J 1 require additional guard logic to prevent execution of expensive messages.
if (LOG.isDebugEnabled()) {
LOG.debug(“Processing block: “ + block); // expensive toString() implementation!
}
– SLF4J simplifies this.
LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled
• Pitfalls
– Forgotten guard logic.
– Logging in a tight loop.
– Logging while holding a shared resource, such as a mutually exclusive lock.
Page 4
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HADOOP-12318: better logging of LDAP exceptions
• Failure to log full details of an authentication failure.
– Very simple patch, huge payoff.
– Include exception details when logging failure.
• Before:
throw new SaslException("PLAIN auth failed: " + e.getMessage());
• After:
throw new SaslException("PLAIN auth failed: " + e.getMessage(), e);
Page 5
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-9434: Recommission a datanode with 500k blocks
may pause NN for 30 seconds
• Logging is too verbose
– Summary of patch: don’t log too much!
– Move detailed logging to trace level.
– It’s still accessible for edge case troubleshooting, but it doesn’t impact base operations.
• Before:
LOG.info("BLOCK* processOverReplicatedBlock: " +
"Postponing processing of over-replicated " +
block + " since storage + " + storage
+ "datanode " + cur + " does not yet have up-to-date " +
"block information.");
• After:
if (LOG.isTraceEnabled()) {
LOG.trace("BLOCK* processOverReplicatedBlock: Postponing " + block
+ " since storage " + storage
+ " does not yet have up-to-date information.");
}
Page 6
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Troubleshooting
• Kerberos is hard.
– Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration.
– Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration.
– When it doesn’t work, finding root cause is challenging.
• Metrics are vital for diagnosis of most operational problems.
– Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike)
– Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls)
Page 7
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HADOOP-12426: kdiag
• Kerberos misconfiguration diagnosis.
– Attempts to diagnose multiple sources of potential Kerberos misconfiguration problems.
– DNS
– Hadoop configuration files
– KDC configuration
• kdiag: a command-line tool for diagnosis of Kerberos problems
– Automatically trigger Java diagnostics, such as -Dsun.security.krb5.debug.
– Prints various environment variables, Java system properties and Hadoop configuration options related to
security.
– Attempt a login.
– If keytab used, print principal information from keytab.
– Print krb5.conf.
– Validate kinit executable (used for ticket renewals).
Page 8
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-6982: nntop
• Find activity trends of HDFS operations.
– HDFS audit log contains a record of each file system operation to the NameNode.
– NameNode metrics contain raw counts of operations.
– Identifying load trends from particular users or particular operations has always required ad-hoc scripting to
analyze the above sources of information.
• nntop: HDFS operation counts aggregated per operation and per user within time windows.
– curl
'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’
– Look for the “TopUserOpCounts” section in the returned JSON.
"ops": [
{
"totalCount": 1,
"opType": "delete",
"topUsers": [
{
"count": 1,
"user": "chris"
}
Page 9
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-7182: JMX metrics aren't accessible when NN is
busy
• Lock contention while attempting to query NameNode JMX metrics.
– JMX metrics are often queried in response to operational problems.
– Some metrics data required acquisition of a lock inside the NameNode. If another thread held this lock, then
metrics could not be accessed.
– During times of high load, the lock is likely to be held by another thread.
– At a time when the metrics are most likely to be needed, they were inaccessible.
– This patch addressed the problem by acquiring the metrics data without requiring the lock held.
Page 10
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Managing Load
• RPC call load.
– It’s too easy for a single inefficient job to overwhelm a cluster with too much RPC load.
– RPC servers accept calls into a single shared queue.
– Overflowing that queue causes increased latency and rejection of calls for all callers, not just the single inefficient
job that caused the problem.
– Load problems can be mitigated with enhanced admission control, client back-off and throttling policies
tailored to real-world usage patterns.
Page 11
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HADOOP-10282: FairCallQueue
• Hadoop RPC Architecture
– Traditionally, Hadoop RPC internally admits incoming RPC calls into a single shared queue.
– Worker threads consume the incoming calls from that shared queue and process them.
– In an overloaded situation, calls spend more time waiting in the queue for a worker thread to become available.
– At the extreme, the queue overflows, which then requires rejecting the calls.
– This tends to punish all callers, not just the caller that triggered the unusually high load.
• RPC Congestion Control with FairCallQueue
– Replace single shared queue with multiple prioritized queues.
– Call is placed into a queue with priority selected based on the calling user’s current history.
– Calls are dequeued and processed with greater frequency from higher-priority queues.
– Under normal operations, when the RPC server can keep up with load, this is not noticeably different from the
original architecture.
– Under high load, this tends to deprioritize users triggering unusually high load, thus allowing room for other
processes to make progress. There is less risk of a single runaway job overwhelming a cluster.
Page 12
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HADOOP-10597: RPC Server signals backoff to clients
when all request queues are full
• Client-side backoff from overloaded RPC servers.
– Builds upon work of the RPC FairCallQueue.
– If an RPC server’s queue is full, then optionally send a signal to additional incoming clients to request backoff.
– Clients are aware of the signal, and react by performing exponential backoff before sending additional calls.
– Improves quality of service for clients when server is under heavy load. RPC calls that would have failed will
instead succeed, but with longer latency.
– Improves likelihood of server recovering, because client backoff will give it more opportunity to catch up.
Page 13
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HADOOP-12916: Allow RPC scheduler/callqueue backoff
using response times
• More flexibility in back-off policies.
– Triggering backoff when the queue is full is in some sense too late. The problem has already grown too severe.
– Instead, track call response time, and trigger backoff when response time exceeds bounds.
– Any amount of queueing increases RPC response latency. Reacting to unusually high RPC response time can
prevent the problem from becoming so severe that the queue overflows.
Page 14
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Performance
• Garbage Collection
– NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.).
– Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are
larger, therefore the memory footprint has increased for tracking block state.)
– Much has been written about garbage collection tuning for large heap JVM processes.
– In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage
collection pressure.
• Block Reporting
– The process by which DataNodes report information about their stored blocks to the NameNode.
– Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently.
– Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently.
– All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency
directly.
– However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user
operations sufficiently.
Page 15
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-7097: Allow block reports to be processed during
checkpointing on standby name node
• Coarse-grained locking impedes block report processing.
– NameNode has a global lock required to enforce mutual exclusion for some operations.
– One such operation is checkpointing performed at the HA standby NameNode: process of creating a new fsimage
representing the full metadata state and beginning a new edit log. This can take a long time in large clusters.
– Block report processing also required holding the lock, and therefore could not proceed during a checkpoint.
• Coarse-grained lock contention can lead to cascading failure and downtime.
– Checkpointing holds lock.
– Frequent incremental block reports from DataNodes block waiting to acquire lock.
– Eventually consumes all available RPC handler threads, all waiting to acquire lock.
– In extreme case, blocks HA NameNode failover, because there is no RPC handler thread available to handle the
failover request.
– Even if HA failover can succeed, may still leave cluster in a state where it appears many nodes have gone dead,
because their blocked heartbeats couldn’t be processed.
• Solution: allow block report processing without holding global lock.
– Block reports now can be processed concurrently with a checkpoint in progress.
– Like most multi-threading and locking logic, required careful reasoning to ensure change was safe.
Page 16
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-7435: PB encoding of block reports is very inefficient
• Block report RPC message encoding can cause memory allocation inefficiency and garbage
collection churn.
– HDFS RPC messages are encoded using Protocol Buffers.
– Block reports encoded each block ID, length and generation stamp in a Protocol Buffers repeated long field.
– Behind the scenes, this becomes an ArrayList with a default capacity of 10.
– DataNodes in large clusters almost always send a larger block report than this, so ArrayList reallocation churn is almost
guaranteed.
– Data type contained in the ArrayList is Long (note captialization, not primitive long).
– Boxing and unboxing causes additional allocation requirements.
• Solution: a more GC-friendly encoding of block reports.
– Within the Protocol Buffers RPC message, take over serialization directly.
– Manually encode number of longs, followed by list of primitive longs.
– Eliminates ArrayList reallocation costs.
– Eliminates boxing and unboxing costs by deserializing straight to primitive long.
Page 17
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-7609: Avoid retry cache collision when Standby
NameNode loading edits
• Idempotence and at-most-once delivery of HDFS RPC messages.
– Some RPC message processing is inherently idempotent: can be applied multiple times, and the final result is still
the same. Example: setPermission.
– Other messages are not inherently idempotent, but the NameNode can still provide an “at-most-once” processing
guarantee by temporarily tracking recently executed operations by a unique call ID. Example: rename.
– The data structure that does this is called the RetryCache.
– This is important in failure modes, such as an HA failover or a network partition, which may cause a client to send
the same message more than once.
• Erroneous multiple RetryCache entries for same operation.
– Duplicate entries caused slowdown.
– Particularly noticeable during an HA transition.
– Bug fix to prevent duplicate entries.
Page 18
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-9710: Change DN to send block receipt IBRs in
batches
• Incremental block reports trigger multiple RPC calls.
– When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately.
– Even multiple block receipts translate to multiple individual incremental block report RPCs.
– With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the
NameNode to process.
• Solution: batch multiple block receipt events into a single RPC message.
– Reduces RPC overhead of sending multiple messages.
– Scales better with respect to number of nodes and number of blocks in a cluster.
Page 19
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Liveness
• "...make progress despite the fact that its concurrently executing components ("processes") may
have to "take turns" in critical sections, parts of the program that cannot be simultaneously run
by multiple processes." -Wikipedia
• DataNode Heartbeats
– Responsible for reporting health of a DataNode to the NameNode.
– Operational problems of managing load and performance can block timely heartbeat processing.
– Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and
asynchronous dispatch of commands (e.g. delete block).
• Blocked heartbeat processing can cause cascading failure and downtime.
– Blocked heartbeat processing can make the NameNode think DataNodes are not heartbeating at all, and
therefore are not running.
– DataNodes that stop running are flagged by the NameNode as dead.
– Too many dead DataNodes makes the cluster inoperable as a whole.
– Dead DataNodes must have their replicas copied to other DataNodes to satisfy replication requirements.
– Erroneously flagging DataNodes as dead can cause a storm of wasteful re-replication activity.
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-9239: DataNode Lifeline Protocol: an alternative
protocol for reporting DataNode health
• The lifeline keeps the DataNode alive, despite conditions of unusually high load.
– Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by
DataNodes.
– Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for
asynchronous command dispatch, and therefore do not need to contend on a shared lock.
– Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive.
– Prevents erroneous and costly re-replication activity.
Page 21
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-9311: Support optional offload of NameNode HA
service health checks to a separate RPC server.
• RPC offload of HA health check and failover messages.
– Similar to problem of timely heartbeat message delivery.
– NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the
NameNode.
– Messages are related to handling periodic health checks and initiating shutdown and failover if necessary.
– A NameNode overwhelmed with unusually high load cannot process these messages.
– Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged
outage period.
– The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the
case of unusually high load.
Page 22
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Optimizing Applications
• HDFS Utilization Patterns
– Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS.
– FileSystem API unfortunately can make it too easy to implement inefficient call patterns.
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HIVE-10223: Consolidate several redundant FileSystem
API calls.
• Hadoop FileSystem API can cause applications to make redundant RPC calls.
• Before:
if (fs.isFile(file)) { // RPC #1
...
} else if (fs.isDirectory(file)) { // RPC #2
...
}
• After:
FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC
if (fileStatus.isFile()) { // Local, no RPC
...
} else if (fileStatus.isDirectory()) { // Local, no RPC
...
}
• Good for Hive, because it reduces latency associated with NameNode RPCs.
• Good for the whole ecosystem, because it reduces load on the NameNode, a shared service.
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
PIG-4442: Eliminate redundant RPC call to get file
information in HPath.
• A similar story of redundant RPC within Pig code.
• Before:
long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1
short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2
• After:
FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC
long blockSize = fileStatus.getBlockSize(); // Local, no RPC
short replication = fileStatus.getReplication(); // Local, no RPC
• Revealed from inspection of HDFS audit log.
– HDFS audit log shows a record of each file system operation executed against the NameNode.
– This continues to be one of the most significant sources of HDFS troubleshooting information.
– In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a
Pig job submission.
Page 25
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS-9924: Asynchronous HDFS Access
• Current Hadoop FileSystem API is inherently synchronous.
– Issue a single synchronous file system call.
– In the case of HDFS, that call is implemented with a synchronous RPC.
– Block waiting for the result.
– Then, client application may proceed.
• Some application usage patterns would benefit from asynchronous access.
– Some applications regularly issue a large sequence of multiple file system calls, with no data dependencies
between the results of those calls.
– For example, Hive partition logic can involve hundreds or thousands of rename operations, where each rename
can execute independently, with no data dependencies on the results of other renames.
public Future<Boolean> rename(Path src, Path dst) throws IOException;
Page 26
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Summary
• A variety of recent enhancements have improved the ability of HDFS to serve as the foundational
storage layer of the Hadoop ecosystem.
• Optimization
– Performance
– Optimizing Applications
• Stabilization
– Liveness
– Managing Load
• Supportability
– Logging
– Troubleshooting
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thank you!
Q&A

More Related Content

What's hot

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
DataStax
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Data Con LA
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
DataWorks Summit
 
Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2
DataStax Academy
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
DataWorks Summit
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
Imply
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and BigtopAccelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
In-Memory Computing Summit
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
DataWorks Summit
 

What's hot (20)

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2Oracle to Cassandra Core Concepts Guide Pt. 2
Oracle to Cassandra Core Concepts Guide Pt. 2
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and BigtopAccelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
Accelerating the Hadoop data stack with Apache Ignite, Spark and Bigtop
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 

Viewers also liked

Como utilizar las redes sociales
Como utilizar las redes socialesComo utilizar las redes sociales
Como utilizar las redes socialesmberaliz
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016
Peter Bakas
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
sheetal sharma
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
Yuval Carmel
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 

Viewers also liked (15)

Como utilizar las redes sociales
Como utilizar las redes socialesComo utilizar las redes sociales
Como utilizar las redes sociales
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 

Similar to Hdfs 2016-hadoop-summit-dublin-v1

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
Big Data Joe™ Rossi
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
Alfresco Software
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
pbelko82
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 

Similar to Hdfs 2016-hadoop-summit-dublin-v1 (20)

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoSD Big Data Monthly Meetup #4 - Session 2 - WANDisco
SD Big Data Monthly Meetup #4 - Session 2 - WANDisco
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 

Recently uploaded

Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 

Recently uploaded (20)

Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 

Hdfs 2016-hadoop-summit-dublin-v1

  • 1. HDFS: Optimization, Stabilization and Supportability April 13, 2016 Chris Nauroth email: cnauroth@hortonworks.com twitter: @cnauroth
  • 2. © Hortonworks Inc. 2011 About Me Chris Nauroth • Member of Technical Staff, Hortonworks – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Major contributor to HDFS ACLs, Windows compatibility, and operability improvements • Hadoop user since 2010 – Prior employment experience deploying, maintaining and using Hadoop clusters Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Motivation • HDFS engineers are on the front line for operational support of Hadoop. – HDFS is the foundational storage layer for typical Hadoop deployments. – Therefore, challenges in HDFS have the potential to impact the entire Hadoop ecosystem. – Conversely, application problems can become visible at the layer of HDFS operations. • Analysis of Hadoop Support Cases – Support case trends reveal common patterns for HDFS operational challenges. – Those challenges inform what needs to improve in the software. • Software Improvements – Optimization: Identify bottlenecks and make them faster. – Stabilization: Prevent unusual circumstances from harming cluster uptime. – Supportability: When something goes wrong, provide visibility and tools to fix it. Thank you to the entire community of Apache contributors. Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Logging • Logging requires a careful balance. – Too little logging hides valuable operational information. – Too much logging causes information overload, increased load and greater garbage collection overhead. • Logging APIs – Hadoop codebase currently uses a mix of logging APIs. – Commons Logging and Log4J 1 require additional guard logic to prevent execution of expensive messages. if (LOG.isDebugEnabled()) { LOG.debug(“Processing block: “ + block); // expensive toString() implementation! } – SLF4J simplifies this. LOG.debug(“Processing block: {}”, block); // calls toString() only if debug enabled • Pitfalls – Forgotten guard logic. – Logging in a tight loop. – Logging while holding a shared resource, such as a mutually exclusive lock. Page 4 Architecting the Future of Big Data
  • 5. © Hortonworks Inc. 2011 HADOOP-12318: better logging of LDAP exceptions • Failure to log full details of an authentication failure. – Very simple patch, huge payoff. – Include exception details when logging failure. • Before: throw new SaslException("PLAIN auth failed: " + e.getMessage()); • After: throw new SaslException("PLAIN auth failed: " + e.getMessage(), e); Page 5 Architecting the Future of Big Data
  • 6. © Hortonworks Inc. 2011 HDFS-9434: Recommission a datanode with 500k blocks may pause NN for 30 seconds • Logging is too verbose – Summary of patch: don’t log too much! – Move detailed logging to trace level. – It’s still accessible for edge case troubleshooting, but it doesn’t impact base operations. • Before: LOG.info("BLOCK* processOverReplicatedBlock: " + "Postponing processing of over-replicated " + block + " since storage + " + storage + "datanode " + cur + " does not yet have up-to-date " + "block information."); • After: if (LOG.isTraceEnabled()) { LOG.trace("BLOCK* processOverReplicatedBlock: Postponing " + block + " since storage " + storage + " does not yet have up-to-date information."); } Page 6 Architecting the Future of Big Data
  • 7. © Hortonworks Inc. 2011 Troubleshooting • Kerberos is hard. – Many moving parts: KDC, DNS, principals, keytabs and Hadoop configuration. – Management tools like Apache Ambari automate initial provisioning of principals, keytabs and configuration. – When it doesn’t work, finding root cause is challenging. • Metrics are vital for diagnosis of most operational problems. – Metrics must be capable of showing that there is a problem. (e.g. RPC call volume spike) – Metrics also must be capable of identifying the source of that problem. (e.g. user issuing RPC calls) Page 7 Architecting the Future of Big Data
  • 8. © Hortonworks Inc. 2011 HADOOP-12426: kdiag • Kerberos misconfiguration diagnosis. – Attempts to diagnose multiple sources of potential Kerberos misconfiguration problems. – DNS – Hadoop configuration files – KDC configuration • kdiag: a command-line tool for diagnosis of Kerberos problems – Automatically trigger Java diagnostics, such as -Dsun.security.krb5.debug. – Prints various environment variables, Java system properties and Hadoop configuration options related to security. – Attempt a login. – If keytab used, print principal information from keytab. – Print krb5.conf. – Validate kinit executable (used for ticket renewals). Page 8 Architecting the Future of Big Data
  • 9. © Hortonworks Inc. 2011 HDFS-6982: nntop • Find activity trends of HDFS operations. – HDFS audit log contains a record of each file system operation to the NameNode. – NameNode metrics contain raw counts of operations. – Identifying load trends from particular users or particular operations has always required ad-hoc scripting to analyze the above sources of information. • nntop: HDFS operation counts aggregated per operation and per user within time windows. – curl 'http://127.0.0.1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystemState’ – Look for the “TopUserOpCounts” section in the returned JSON. "ops": [ { "totalCount": 1, "opType": "delete", "topUsers": [ { "count": 1, "user": "chris" } Page 9 Architecting the Future of Big Data
  • 10. © Hortonworks Inc. 2011 HDFS-7182: JMX metrics aren't accessible when NN is busy • Lock contention while attempting to query NameNode JMX metrics. – JMX metrics are often queried in response to operational problems. – Some metrics data required acquisition of a lock inside the NameNode. If another thread held this lock, then metrics could not be accessed. – During times of high load, the lock is likely to be held by another thread. – At a time when the metrics are most likely to be needed, they were inaccessible. – This patch addressed the problem by acquiring the metrics data without requiring the lock held. Page 10 Architecting the Future of Big Data
  • 11. © Hortonworks Inc. 2011 Managing Load • RPC call load. – It’s too easy for a single inefficient job to overwhelm a cluster with too much RPC load. – RPC servers accept calls into a single shared queue. – Overflowing that queue causes increased latency and rejection of calls for all callers, not just the single inefficient job that caused the problem. – Load problems can be mitigated with enhanced admission control, client back-off and throttling policies tailored to real-world usage patterns. Page 11 Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011 HADOOP-10282: FairCallQueue • Hadoop RPC Architecture – Traditionally, Hadoop RPC internally admits incoming RPC calls into a single shared queue. – Worker threads consume the incoming calls from that shared queue and process them. – In an overloaded situation, calls spend more time waiting in the queue for a worker thread to become available. – At the extreme, the queue overflows, which then requires rejecting the calls. – This tends to punish all callers, not just the caller that triggered the unusually high load. • RPC Congestion Control with FairCallQueue – Replace single shared queue with multiple prioritized queues. – Call is placed into a queue with priority selected based on the calling user’s current history. – Calls are dequeued and processed with greater frequency from higher-priority queues. – Under normal operations, when the RPC server can keep up with load, this is not noticeably different from the original architecture. – Under high load, this tends to deprioritize users triggering unusually high load, thus allowing room for other processes to make progress. There is less risk of a single runaway job overwhelming a cluster. Page 12 Architecting the Future of Big Data
  • 13. © Hortonworks Inc. 2011 HADOOP-10597: RPC Server signals backoff to clients when all request queues are full • Client-side backoff from overloaded RPC servers. – Builds upon work of the RPC FairCallQueue. – If an RPC server’s queue is full, then optionally send a signal to additional incoming clients to request backoff. – Clients are aware of the signal, and react by performing exponential backoff before sending additional calls. – Improves quality of service for clients when server is under heavy load. RPC calls that would have failed will instead succeed, but with longer latency. – Improves likelihood of server recovering, because client backoff will give it more opportunity to catch up. Page 13 Architecting the Future of Big Data
  • 14. © Hortonworks Inc. 2011 HADOOP-12916: Allow RPC scheduler/callqueue backoff using response times • More flexibility in back-off policies. – Triggering backoff when the queue is full is in some sense too late. The problem has already grown too severe. – Instead, track call response time, and trigger backoff when response time exceeds bounds. – Any amount of queueing increases RPC response latency. Reacting to unusually high RPC response time can prevent the problem from becoming so severe that the queue overflows. Page 14 Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011 Performance • Garbage Collection – NameNode heap must scale up in relation to the number of file system objects (files, directories, blocks, etc.). – Recent hardware trends can cause larger DataNode heaps too. (Nodes have more disks and those disks are larger, therefore the memory footprint has increased for tracking block state.) – Much has been written about garbage collection tuning for large heap JVM processes. – In addition to recommending configuration best practices, we can optimize the codebase to reduce garbage collection pressure. • Block Reporting – The process by which DataNodes report information about their stored blocks to the NameNode. – Full Block Report: a complete catalog of all of the node’s blocks, sent infrequently. – Incremental Block Report: partial information about recently added or deleted blocks, sent more frequently. – All block reporting occurs asynchronous of any user-facing operations, so it does not impact end user latency directly. – However, inefficiencies in block reporting can overwhelm a cluster to the point that it can no longer serve end user operations sufficiently. Page 15 Architecting the Future of Big Data
  • 16. © Hortonworks Inc. 2011 HDFS-7097: Allow block reports to be processed during checkpointing on standby name node • Coarse-grained locking impedes block report processing. – NameNode has a global lock required to enforce mutual exclusion for some operations. – One such operation is checkpointing performed at the HA standby NameNode: process of creating a new fsimage representing the full metadata state and beginning a new edit log. This can take a long time in large clusters. – Block report processing also required holding the lock, and therefore could not proceed during a checkpoint. • Coarse-grained lock contention can lead to cascading failure and downtime. – Checkpointing holds lock. – Frequent incremental block reports from DataNodes block waiting to acquire lock. – Eventually consumes all available RPC handler threads, all waiting to acquire lock. – In extreme case, blocks HA NameNode failover, because there is no RPC handler thread available to handle the failover request. – Even if HA failover can succeed, may still leave cluster in a state where it appears many nodes have gone dead, because their blocked heartbeats couldn’t be processed. • Solution: allow block report processing without holding global lock. – Block reports now can be processed concurrently with a checkpoint in progress. – Like most multi-threading and locking logic, required careful reasoning to ensure change was safe. Page 16 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 HDFS-7435: PB encoding of block reports is very inefficient • Block report RPC message encoding can cause memory allocation inefficiency and garbage collection churn. – HDFS RPC messages are encoded using Protocol Buffers. – Block reports encoded each block ID, length and generation stamp in a Protocol Buffers repeated long field. – Behind the scenes, this becomes an ArrayList with a default capacity of 10. – DataNodes in large clusters almost always send a larger block report than this, so ArrayList reallocation churn is almost guaranteed. – Data type contained in the ArrayList is Long (note captialization, not primitive long). – Boxing and unboxing causes additional allocation requirements. • Solution: a more GC-friendly encoding of block reports. – Within the Protocol Buffers RPC message, take over serialization directly. – Manually encode number of longs, followed by list of primitive longs. – Eliminates ArrayList reallocation costs. – Eliminates boxing and unboxing costs by deserializing straight to primitive long. Page 17 Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 HDFS-7609: Avoid retry cache collision when Standby NameNode loading edits • Idempotence and at-most-once delivery of HDFS RPC messages. – Some RPC message processing is inherently idempotent: can be applied multiple times, and the final result is still the same. Example: setPermission. – Other messages are not inherently idempotent, but the NameNode can still provide an “at-most-once” processing guarantee by temporarily tracking recently executed operations by a unique call ID. Example: rename. – The data structure that does this is called the RetryCache. – This is important in failure modes, such as an HA failover or a network partition, which may cause a client to send the same message more than once. • Erroneous multiple RetryCache entries for same operation. – Duplicate entries caused slowdown. – Particularly noticeable during an HA transition. – Bug fix to prevent duplicate entries. Page 18 Architecting the Future of Big Data
  • 19. © Hortonworks Inc. 2011 HDFS-9710: Change DN to send block receipt IBRs in batches • Incremental block reports trigger multiple RPC calls. – When a DataNode receives a block, it sends an incremental block report RPC to the NameNode immediately. – Even multiple block receipts translate to multiple individual incremental block report RPCs. – With consideration of all DataNodes in a large cluster, this can become a huge number of RPC messages for the NameNode to process. • Solution: batch multiple block receipt events into a single RPC message. – Reduces RPC overhead of sending multiple messages. – Scales better with respect to number of nodes and number of blocks in a cluster. Page 19 Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 Liveness • "...make progress despite the fact that its concurrently executing components ("processes") may have to "take turns" in critical sections, parts of the program that cannot be simultaneously run by multiple processes." -Wikipedia • DataNode Heartbeats – Responsible for reporting health of a DataNode to the NameNode. – Operational problems of managing load and performance can block timely heartbeat processing. – Heartbeat processing at the NameNode can be surprisingly costly due to contention on a global lock and asynchronous dispatch of commands (e.g. delete block). • Blocked heartbeat processing can cause cascading failure and downtime. – Blocked heartbeat processing can make the NameNode think DataNodes are not heartbeating at all, and therefore are not running. – DataNodes that stop running are flagged by the NameNode as dead. – Too many dead DataNodes makes the cluster inoperable as a whole. – Dead DataNodes must have their replicas copied to other DataNodes to satisfy replication requirements. – Erroneously flagging DataNodes as dead can cause a storm of wasteful re-replication activity. Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 HDFS-9239: DataNode Lifeline Protocol: an alternative protocol for reporting DataNode health • The lifeline keeps the DataNode alive, despite conditions of unusually high load. – Optionally run a separate RPC server within the NameNode dedicated to processing of lifeline messages sent by DataNodes. – Lifeline messages are a simplified form of heartbeat messages, but do not have the same costly requirements for asynchronous command dispatch, and therefore do not need to contend on a shared lock. – Even if the main NameNode RPC queue is overwhelmed, the lifeline still keeps the DataNode alive. – Prevents erroneous and costly re-replication activity. Page 21 Architecting the Future of Big Data
  • 22. © Hortonworks Inc. 2011 HDFS-9311: Support optional offload of NameNode HA service health checks to a separate RPC server. • RPC offload of HA health check and failover messages. – Similar to problem of timely heartbeat message delivery. – NameNode HA requires messages sent from the ZKFC (ZooKeeper Failover Controller) process to the NameNode. – Messages are related to handling periodic health checks and initiating shutdown and failover if necessary. – A NameNode overwhelmed with unusually high load cannot process these messages. – Delayed processing of these messages slows down NameNode failover, and thus creates a visibly prolonged outage period. – The lifeline RPC server can be used to offload HA messages, and similarly keep processing them even in the case of unusually high load. Page 22 Architecting the Future of Big Data
  • 23. © Hortonworks Inc. 2011 Optimizing Applications • HDFS Utilization Patterns – Sometimes it’s helpful to look a layer higher and assess what applications are doing with HDFS. – FileSystem API unfortunately can make it too easy to implement inefficient call patterns. Page 23 Architecting the Future of Big Data
  • 24. © Hortonworks Inc. 2011 HIVE-10223: Consolidate several redundant FileSystem API calls. • Hadoop FileSystem API can cause applications to make redundant RPC calls. • Before: if (fs.isFile(file)) { // RPC #1 ... } else if (fs.isDirectory(file)) { // RPC #2 ... } • After: FileStatus fileStatus = fs.getFileStatus(file); // Just 1 RPC if (fileStatus.isFile()) { // Local, no RPC ... } else if (fileStatus.isDirectory()) { // Local, no RPC ... } • Good for Hive, because it reduces latency associated with NameNode RPCs. • Good for the whole ecosystem, because it reduces load on the NameNode, a shared service. Page 24 Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011 PIG-4442: Eliminate redundant RPC call to get file information in HPath. • A similar story of redundant RPC within Pig code. • Before: long blockSize = fs.getHFS().getFileStatus(path).getBlockSize(); // RPC #1 short replication = fs.getHFS().getFileStatus(path).getReplication(); // RPC #2 • After: FileStatus fileStatus = fs.getHFS().getFileStatus(path); // Just 1 RPC long blockSize = fileStatus.getBlockSize(); // Local, no RPC short replication = fileStatus.getReplication(); // Local, no RPC • Revealed from inspection of HDFS audit log. – HDFS audit log shows a record of each file system operation executed against the NameNode. – This continues to be one of the most significant sources of HDFS troubleshooting information. – In this case, manual inspection revealed a suspicious pattern of multiple getfileinfo calls for the same path from a Pig job submission. Page 25 Architecting the Future of Big Data
  • 26. © Hortonworks Inc. 2011 HDFS-9924: Asynchronous HDFS Access • Current Hadoop FileSystem API is inherently synchronous. – Issue a single synchronous file system call. – In the case of HDFS, that call is implemented with a synchronous RPC. – Block waiting for the result. – Then, client application may proceed. • Some application usage patterns would benefit from asynchronous access. – Some applications regularly issue a large sequence of multiple file system calls, with no data dependencies between the results of those calls. – For example, Hive partition logic can involve hundreds or thousands of rename operations, where each rename can execute independently, with no data dependencies on the results of other renames. public Future<Boolean> rename(Path src, Path dst) throws IOException; Page 26 Architecting the Future of Big Data
  • 27. © Hortonworks Inc. 2011 Summary • A variety of recent enhancements have improved the ability of HDFS to serve as the foundational storage layer of the Hadoop ecosystem. • Optimization – Performance – Optimizing Applications • Stabilization – Liveness – Managing Load • Supportability – Logging – Troubleshooting Page 27 Architecting the Future of Big Data
  • 28. © Hortonworks Inc. 2011 Thank you! Q&A

Editor's Notes

  1. Thank Arpit.
  2. We’ll look at specific Apache JIRA issues, some not yet shipped, some still in progress. Small patches often yield big wins. Sometimes those patches are even small enough to fit on a PowerPoint slide, as you’re about to see. Some are larger.
  3. These are common challenges for any large Java codebase, not just specific to Hadoop.
  4. Too little logging. Size of code change: 3 characters. Without this extra logging information, diagnosis is very challenging.
  5. Too much logging.
  6. Kerberos is notorious for obtuse error messages that don’t directly point out root cause.
  7. These are often steps we need to follow in any case that requires Kerberos troubleshooting. Codifying these steps into a standard tool makes gathering this information easier and more consistent.
  8. Helps find the naughty user who is overwhelming your cluster.
  9. “smoothing”
  10. In contrast to managing an overloaded situation, how can we more effectively handle more load?
  11. Garbage collection friendly data structures are particularly relevant to the NameNode, which has a large heap size requirement.
  12. Data structure not efficient for duplicate entries. (Not the use case.)
  13. We’ve talked about how HDFS can better react to overloaded conditions, and we’ve talked about improving HDFS to handle more total load. What is the source of that load? Is it legitimate?
  14. I encourage you to explore and analyze the HDFS audit log in your clusters.
  15. Improving the API to encourage more efficient applications.
  16. Performance of HDFS itself and also optimizing applications.