The document discusses availability and reliability in distributed systems. It describes that for a system to be truly reliable, it must be fault-tolerant, highly available, recoverable, consistent, scalable, have predictable performance, and be secure. It then discusses how the namenode is a single point of failure in Hadoop, and describes various approaches to improve availability through replicating metadata and using secondary or backup nodes.
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
Josh Berkus
You've heard that PostgreSQL is the highest-performance transactional open source database, but you're not seeing it on YOUR server. In fact, your PostgreSQL application is kind of poky. What should you do? While doing advanced performance engineering for really high-end systems takes years to learn, you can learn the basics to solve performance issues for 80% of PostgreSQL installations in less than an hour. In this session, you will learn: -- The parts of database application performance -- The performance setup procedure -- Basic troubleshooting tools -- The 13 postgresql.conf settings you need to know -- Where to look for more information.
Drill Down the most underestimate Oracle Feature - Database Resource ManagerLuis Marques
Being a crucial feature on managing database load and with real world practice showing that Database
Resource Manager (DBRM) is not often used, this talk want to change this and demystify this feature by
explaining how it works in detail on different scenarios, the CPU math behind it, how to measure it in
real-time using Python and SQL and exploring more complex features to understand its behaviour.
Special attention will be made to understand its internals whenever possible.
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
"Performance Analysis Methodologies", USENIX/LISA12, San Diego, 2012
Performance analysis methodologies provide guidance, save time, and can find issues that are otherwise overlooked. Example issues include hardware bus saturation, lock contention, recoverable device errors, kernel scheduling issues, and unnecessary workloads. The talk will focus on the USE Method: a simple strategy for all staff for performing a complete check of system performance health, identifying common bottlenecks and errors. Other analysis methods discussed include workload characterization, drill-down analysis, and latency analysis, with example applications from enterprise and cloud computing. Don’t just reach for tools—use a method!
SharePoint Backup And Disaster Recovery with Joel OlesonJoel Oleson
This walks through the various options around backup and restore with SharePoint. This deck was presented at Tech Ed South East Asia 2008 by Joel Oleson
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Microsoft Technet France
Attention, Session en Anglais. Attention Session en 2 parties. Ceci est la première partie. Cette session sera animée par Scott Schnoll, Senior Content Developer chez Microsoft Corp et veritable Gourou Exchange. La messagerie est un élément ultra critique du système d'information : Elle ne DOIT PAS tomber. Pour cela, Exchange 2013 intègre les toutes dernières technologies en terme de tolérance de panne et de haute disponibilité. Scott Schnoll vous expliquera la mécanique de l'intérieur ! Cette session vous donne accès à l'état de l'art sur Exchange. C'est LA session à suivre pour découvrir la mécanique de haute disponibilité d'Exchange 2013.
Speaker : Scott Schnoll (Microsoft)
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Microsoft Technet France
La nouvelle version d'Exchange Server 2013 intègre une foule de nouveautés lui permettant d'être aujourd'hui le serveur de messagerie le plus sécurisé et le plus fiable sur le marché. L'expérience acquise par la gestion des solutions de messagerie Cloud par les équipes Microsoft a été directement intégrée dans cette nouvelle version du produit ce qui va vous permettre la mise en place d'un système de messagerie ultra résilient. Scott Schnoll, Principal Technical Writer dans l'équipe Exchange à Microsoft Corp va vous expliquer de manière didactique l'ensemble des mécanismes de haute disponibilité et les solutions de resilience inter sites dans les plus petits détails. Venez apprendre directement par l'expert qui a travaillé sur ces sujets chez Microsoft ! Attention, session très technique, en anglais.
Introduction to hadoop high availability Omid Vahdaty
Understand how to create a highly available Hadoop cluster.
Active/passive. with manual failover. links to help you get started, knowing to focus on. common mistakes etc.
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818...Michael Pirker
Latency problem was reported for VDisk CA-CL1-Disk04-N at 02/05/15 8:09,
The environment are two clusters connected with Metro Mirror. The first aim of this document is to show how we found the root cause of this problem in the link between the two clusters.
The second aim of this document is to describe how the root cause for this problem was found by using the BVQ structured performance problem analysis method. It demonstrates that successful analysis work needs a structured method and also a tool which supports this method and delivers the needed technical insight. We have the concept that everybody should be able to conduct a performance analysis. This is important because the level of service is lowered day by day and especially small customers are more and more reliant on their own skills or on the skills of their partners. This is a common problem occurring at all vendors!
Replication, Durability, and Disaster RecoverySteven Francia
This session introduces the basic components of high availability before going into a deep dive on MongoDB replication. We'll explore some of the advanced capabilities with MongoDB replication and best practices to ensure data durability and redundancy. We'll also look at various deployment scenarios and disaster recovery configurations.
Introduction to Cloud Computing Data Center and Network Issues to Internet Research Lab at NTU, Taiwan. Another definition of cloud computing and comparison of traditional IT warehouse and current cloud data center. (ppt slide for download.) Take a opensource data center management OS, OpenStack, as an example. Underlying network issues inside a cloud DC.
Various HA and DR setups for Postgres Plus Advanced Server -
Active – Passive OS HA Clustering
Log Shipping Replication (Hot Standby Mode)
Hot Streaming Replication (Hot Standby Mode)
EDB Postgres Plus Failover Manager
HA with read scaling (with pg-pool)
xDB Single Master Replication (SMR)
xDB Multi Master Replication (MMR)
Use Cases
Drill Down the most underestimate Oracle Feature - Database Resource ManagerLuis Marques
Being a crucial feature on managing database load and with real world practice showing that Database
Resource Manager (DBRM) is not often used, this talk want to change this and demystify this feature by
explaining how it works in detail on different scenarios, the CPU math behind it, how to measure it in
real-time using Python and SQL and exploring more complex features to understand its behaviour.
Special attention will be made to understand its internals whenever possible.
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
"Performance Analysis Methodologies", USENIX/LISA12, San Diego, 2012
Performance analysis methodologies provide guidance, save time, and can find issues that are otherwise overlooked. Example issues include hardware bus saturation, lock contention, recoverable device errors, kernel scheduling issues, and unnecessary workloads. The talk will focus on the USE Method: a simple strategy for all staff for performing a complete check of system performance health, identifying common bottlenecks and errors. Other analysis methods discussed include workload characterization, drill-down analysis, and latency analysis, with example applications from enterprise and cloud computing. Don’t just reach for tools—use a method!
SharePoint Backup And Disaster Recovery with Joel OlesonJoel Oleson
This walks through the various options around backup and restore with SharePoint. This deck was presented at Tech Ed South East Asia 2008 by Joel Oleson
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Microsoft Technet France
Attention, Session en Anglais. Attention Session en 2 parties. Ceci est la première partie. Cette session sera animée par Scott Schnoll, Senior Content Developer chez Microsoft Corp et veritable Gourou Exchange. La messagerie est un élément ultra critique du système d'information : Elle ne DOIT PAS tomber. Pour cela, Exchange 2013 intègre les toutes dernières technologies en terme de tolérance de panne et de haute disponibilité. Scott Schnoll vous expliquera la mécanique de l'intérieur ! Cette session vous donne accès à l'état de l'art sur Exchange. C'est LA session à suivre pour découvrir la mécanique de haute disponibilité d'Exchange 2013.
Speaker : Scott Schnoll (Microsoft)
Exchange Server 2013 : les mécanismes de haute disponibilité et la redondance...Microsoft Technet France
La nouvelle version d'Exchange Server 2013 intègre une foule de nouveautés lui permettant d'être aujourd'hui le serveur de messagerie le plus sécurisé et le plus fiable sur le marché. L'expérience acquise par la gestion des solutions de messagerie Cloud par les équipes Microsoft a été directement intégrée dans cette nouvelle version du produit ce qui va vous permettre la mise en place d'un système de messagerie ultra résilient. Scott Schnoll, Principal Technical Writer dans l'équipe Exchange à Microsoft Corp va vous expliquer de manière didactique l'ensemble des mécanismes de haute disponibilité et les solutions de resilience inter sites dans les plus petits détails. Venez apprendre directement par l'expert qui a travaillé sur ces sujets chez Microsoft ! Attention, session très technique, en anglais.
Introduction to hadoop high availability Omid Vahdaty
Understand how to create a highly available Hadoop cluster.
Active/passive. with manual failover. links to help you get started, knowing to focus on. common mistakes etc.
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Analyze a SVC, STORWIZE metro/ global mirror performance problem-v58-20150818...Michael Pirker
Latency problem was reported for VDisk CA-CL1-Disk04-N at 02/05/15 8:09,
The environment are two clusters connected with Metro Mirror. The first aim of this document is to show how we found the root cause of this problem in the link between the two clusters.
The second aim of this document is to describe how the root cause for this problem was found by using the BVQ structured performance problem analysis method. It demonstrates that successful analysis work needs a structured method and also a tool which supports this method and delivers the needed technical insight. We have the concept that everybody should be able to conduct a performance analysis. This is important because the level of service is lowered day by day and especially small customers are more and more reliant on their own skills or on the skills of their partners. This is a common problem occurring at all vendors!
Replication, Durability, and Disaster RecoverySteven Francia
This session introduces the basic components of high availability before going into a deep dive on MongoDB replication. We'll explore some of the advanced capabilities with MongoDB replication and best practices to ensure data durability and redundancy. We'll also look at various deployment scenarios and disaster recovery configurations.
Introduction to Cloud Computing Data Center and Network Issues to Internet Research Lab at NTU, Taiwan. Another definition of cloud computing and comparison of traditional IT warehouse and current cloud data center. (ppt slide for download.) Take a opensource data center management OS, OpenStack, as an example. Underlying network issues inside a cloud DC.
Various HA and DR setups for Postgres Plus Advanced Server -
Active – Passive OS HA Clustering
Log Shipping Replication (Hot Standby Mode)
Hot Streaming Replication (Hot Standby Mode)
EDB Postgres Plus Failover Manager
HA with read scaling (with pg-pool)
xDB Single Master Replication (SMR)
xDB Multi Master Replication (MMR)
Use Cases
A session from Qubole Best Practice Webinar Series- “Big Data Secrets from the Pros”. Covers how to make Apache Hive queries run faster by
a. Better layout of data on HDFS via partitioning and bucketing
b. Designing test queries by using block and bucket sampling before running the queries on large datasets
c. Using bucket map joins and parallel processing to run queries faster
Visit www.qubole.com for more information.
Still All on One Server: Perforce at Scale Perforce
Google runs the busiest single Perforce server on the planet, and one of the largest repositories in any source control system. This session will address server performance and other issues of scale, as well as where Google is in general, how it got there and how it continues to stay ahead of its users.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Life In The FastLane: Full Speed XPagesUlrich Krause
Using XPages out of the box lets you build good looking and well performing applications. However, as XPage applications become bigger and more complex, performance can become an issue and, if it comes to scalability and speed optimization, there are a couple of things to take into consideration.
Learn how to use partial refresh and partial execution mode and how to monitor its execution using a JSF LifeCycle monitor to avoid multiple re-calculation of controls. We will show tools that can allow you to profile your code, readily available from OpenNTF, along with a demonstration of how to use them to improve the speed of your code.
Still writing SSJS and encounter a significant slow down when using Script Libraries? See, how you can improve the speed of your application using JAVA instead of JS, JSON and even @formulas.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
How to get the maximum performance from your AEP server. This will discuss ways to improve execution time of short running jobs and how to properly configure the server depending on the expected number of users as well as the average size and duration of individual jobs. Included will be examples of making use of job pooling, Database connection sharing, and parallel subprotocol tuning. Determining when to make use of cluster, grid, or load balanced configurations along with memory and CPU sizing guidelines will also be discussed.
This presentation introduces the Big Data topic to Software Quality Assurance Engineers. It can also be useful for Software Developers and other software professionals.
2. Reliability in distributed system
•To be truly reliable, a distributed system must have the following characteristics:
–Fault-Tolerant: It can recover from component failures without performing incorrect actions.
–Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
–Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
–Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
–Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
–Predictable Performance: The ability to provide desired responsiveness in a timely manner.
–Secure: The system authenticates access to data and services
3. SPOF
•The combination of
–replicating namenode metadata on multiple file-systems, and
–using the secondary namenode to create checkpoints
•Protects against data loss
•But does not provide high-availability of the file- system.
4. SPOF
•The namenode is still
•a single point of failure (SPOF),
•since if it did fail
–all client
–including MapReduce jobs
•would be unable to read, write, or list files
•Because the namenode is the sole repository of
–the metadata and the
–file-to-block mapping
5. SPOF
•In such an event
•The whole Hadoop system would effectively be
•“Out of service”
•Until a new namenode could be brought online.
6. Reasons for downtime
•An important part of
•improving availability and
•articulating requirements is
•understanding the causes of downtime
•There are many types of failures in distributed systems, ways to classify them, and analyses of how failures result in downtime.
8. Hardware failures
•Hosts and their connections may fail
•Hardware failures on the master host
•or a failure in the connection between the master and the majority of the slaves
•can cause system downtime
9. Software failures
•Software bugs may cause a component in the system to stop functioning or require a restart.
•For example, a bug in upgrade code could result in downtime due to data corruption.
•A dependent software component may become unavailable (e.g. the Java garbage collector enters a stop-the-world phase).
•A software bug in a master service will likely cause downtime.
10. Software failures
•Software failures are a significant issue in distributed systems.
•Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%).
•Residual bugs in mature systems can be classified into two main categories .
11. Heisenbug
•A bug that seems to disappear or alter its characteristics when it is observed or researched.
•A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug- mode.
•The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).
12. Bohrbug
•A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched.
•A Bohrbugtypically manifests itself reliably under a well- defined set of conditions.
13. Software failures
•Heisenbugstend to be more prevalent in distributed systems than in local systems.
•One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.
14. Operator errors
•People make mistakes.
•Hadoop attempts to limit operator error by simplifying administration, validating its configuration, and providing useful messages in logs and UI components;
•however operator mistakes may still cause downtime.
15. Strategy
Severity of Database Downtime
Planned
Unplanned
Catastrophic
Latency of Database Recovery
No Downtime
HighAvailability
ContinuousAvailability
DisasterRecovery
OnlineMaintenance
OfflineMaintenance
HighAvailabilityClusters
Switchingand WarmStandbyReplication
ColdStandby
16. Recall
•The NameNodestores modifications to the file system as a log appended to a native file system file,edits.
•When a NameNodestarts up, it reads HDFS state from an image file,fsimage, and then applies edits from the edits log file.
•It then writes new HDFS state to thefsimageand starts normal operation with an empty edits file.
•Since NameNodemergesfsimageandeditsfiles only during start up, the edits log file could get very large over time on a busy cluster.
•Another side effect of a larger edits file is that next restart of NameNodetakes longer.
17. Availability –Attempt 1 -Secondary namenode
•Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
•The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge.
•It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing.
•However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain.
•The usual course of action in this case is to copy the namenode’smetadata files that are on NFS to the secondary and run it as the new primary.
•The secondary NameNodestores the latest checkpoint in a directory which is structured the same way as the primary NameNode'sdirectory.
•So that the check pointed image is always ready to be read by the primary NameNodeif necessary.
18. Long Recovery
•To recover from a failed namenode, an administrator starts a new primary namenode with one of the file-system metadata replicas, and configures datanodesand clients to use this new namenode.
•The new namenode is not able to serve requests until it has
–loaded its namespace image into memory,
–replayed its editlog, and
–received enough block reports from the datanodesto leave safe mode.
•On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.
•The long recovery time is a problem for routine maintenance too.
•In fact, since unexpected failure of the namenode is so rare, the case for planned downtime is actually more important in practice.
19. Other roads to availability
•NameNodepersists its namespace using two files:
–fsimage, which is the latest checkpoint of the namespace andedits,
–a journal (log) of changes to the namespace since the checkpoint.
•When a NameNodestarts up, it merges thefsimageandeditsjournal to provide an up-to-date view of the file system metadata.
•The NameNodethen overwritesfsimagewith the new HDFS state and begins a neweditsjournal.
•Secondary name-node acts as mere checkpointer.
•Secondary name-node should be transformed into a standby name-node (SNN).
•Make it a warm standby.
•Provide real time streaming of edits to SNN so that it contained the up-to-date namespace state.
20. Availability –Attempt 2 -Backup node Checkpoint node
•The Checkpoint node periodically creates checkpoints of the namespace.
•Downloadsfsimageandeditsfrom the active NameNode
•Merges locally, and uploads the new image back to the active NameNode.
•The Backup node provides
–the same checkpointingfunctionality as the Checkpoint node,
–as well as maintaining an in-memory, up-to-date copy of the file system namespace
•Always synchronized with the active NameNodestate.
•Maintain up-to-date copy of the filysystemnamespace in the memory
•Both run on different server
–Primary and backup node
–Since memory requirements are of same order
•The Backup node does not need to download -since it already has an up-to- date state of the namespace state in memory.
21. Terminology
•Active NN
–NN that is actively serving the read and write operations from the clients.
•Standby NN
–this NN waits and becomes active when the Active dies or is unhealthy.
–Backup Node as in Hadoop release 0.21 could be used to implement the Standby for the “shared-nothing” storage of filesystemnamespace.
•Cold Standby
–Standby NN has zero state (e.g.it is started after the Active is declared dead)
•Warm Standby
–Standby has partial state:
–it has loaded fsImageand editLogsbut has not received any block reports
–it has loaded fsImageand rolledlogsand all blockreports
•Hot Standby
–Standby has all most of the Active’s state and start immediately
22. High Level Use Cases
•Planned Downtime :
–A Hadoop cluster is often shut down in order to upgrade the software or configuration.
–A Hadoop cluster of 4000 nodes takes approximately 2 hours to be restarted.
•Unplanned Downtime or Unresponsive Service.
–The failover of the Namenode service can occur due to hardware, osfailure, a failure of the Namenode daemon, or because the
–Namenode daemon becomes unresponsive for a few minutes.
–While this is not as common as one may expect the failure can occur at unexpected times and may have an impact on meeting the SLAs of some critical applications.
23. Specific use case
1.Single NN configuration; no failover.
2.Active and Standby with manual failover.
a)Standby could be cold/warm/hot.
3.Active and Standby with automatic failover.
a)Both NNs started, one automatically becomes active and the other standby
b)Active and Standby running
c)Active fails, or is unhealthy; Standby takes over.
d)Active and Standby running -Active is shutdown
e)Active and Standby running, Standby fails. Active continues.
f)Active running, Standby down for maintenance. Active dies and cannot start. Standby is started and takes over as active.
g)Both NNs started, only one comes up. It becomes active
h)Active and Standby running; Active state is unknown (e.g. disconnected from heartbeat) and Standby takes over.
25. HDFS-HA
•In this implementation there is a pair of namenodesin an active-standby configuration.
•In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.
•A few architectural changes are needed to allow this to happen:
–The namenodesmust use highly-available shared storage to share the edit log. (In the initial implementation of HA this will require an NFS filer, but in future releases more options will be provided, such as a BookKeeper-based system built on Zoo-Keeper.)
–When a standby namenode comes up it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues to read new entries as they are written by the active namenode.
–Datanodesmust send block reports to both namenodessince the block mappings are stored in a namenode’smemory, and not on disk.
–Clients must be configured to handle namenode failover, which uses a mechanism that is transparent to users.
27. Failover in HDFS-HA
•If the active namenode fails, then the standby can take over very quickly (in a few tens of seconds) since it has the latest state available in memory:
–both the latest edit log entries, and
–an up-to-date block mapping.
•The actual observed failover time will be longer in practice (around a minute or so), since the system needs to be conservative in deciding that the active namenode has failed.
•In the unlikely event of the standby being down when the active fails, the administrator can still start the standby from cold.
•This is no worse than the non-HA case, and from an operational point of view it’s an improvement, since the process is a standard operational procedure built into Hadoop.
•The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller.
•Failover controllers are pluggable, but the first implementation uses ZooKeeperto ensure that only one namenode is active.
•Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures (using a simple heartbeatingmechanism) and trigger a failover should a namenode fail.
28. Fencing
•It is vital for the correct operation of an HA cluster that only one of the NameNodesbe Active at a time.
•Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results.
•In order to ensure this and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage.
•The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption—a method known as fencing.
•Fencing mechanism:
–killing the namenode’sprocess,
–revoking its access to the shared storage directory (typically by using a vendor- specific NFS command), and
–disabling its network port via a remote management command.
–As a last resort, the previously active namenode can be fenced with a technique rather graphically known as STONITH, or “shoot the other node in the head”, which uses a specialized power distribution unit to forcibly power down the host machine.
29. Client side
•Client failover is handled transparently by the client library.
•The simplest implementation uses client-side configuration to control failover.
•The HDFS URI uses a logical hostname which is mapped to a pair of namenode addresses (in the configuration file), and the client library tries each namenode address until the operation succeeds.