The document describes research on improving the high availability of HDFS (Hadoop Distributed File System) by implementing a hot standby node. The researchers modified HDFS's existing backup node functionality to evolve it into a hot standby node that can immediately take over if the primary namenode fails. This was done by replicating additional state like leases and block locations to the standby and using ZooKeeper for failure detection. Experiments showed the solution added little overhead while reducing failover time from minutes to seconds. The hot standby implementation addressed a single point of failure in HDFS and improved its ability to tolerate namenode failures.
Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across commodity hardware. It is highly fault tolerant, provides high throughput, and is suitable for applications with large datasets. HDFS uses a master/slave architecture where a NameNode manages the file system namespace and DataNodes store data blocks. The NameNode ensures data replication across DataNodes for reliability. HDFS is optimized for batch processing workloads where computations are moved to nodes storing data blocks.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
HDFS is a distributed file system designed for storing very large data files across commodity servers or clusters. It works on a master-slave architecture with one namenode (master) and multiple datanodes (slaves). The namenode manages the file system metadata and regulates client access, while datanodes store and retrieve block data from their local file systems. Files are divided into large blocks which are replicated across datanodes for fault tolerance. The namenode monitors datanodes and replicates blocks if their replication drops below a threshold.
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
The document provides an introduction to YARN, HDFS federation, and HDFS high availability. It discusses limitations of the original MapReduce framework and HDFS, such as single points of failure. It then summarizes improvements in YARN including distributed resource management and the ability to run multiple applications. HDFS federation and high availability address scalability and reliability concerns by partitioning the namespace and introducing redundant NameNodes. Configuration parameters and Apache Whirr are also covered for quickly setting up a YARN cluster.
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
There is a plethora of storage solutions for big data, each having its own pros and cons. The objective of this talk is to delve deeper into specific classes of storage types like Distributed File Systems, in-memory Key Value Stores, Big Table Stores and provide insights on how to choose the right storage solution for a specific class of problems. For instance, running large analytic workloads, iterative machine learning algorithms, and real time analytics.
The talk will cover HDFS, HBase and brief introduction to Redis
HDFS is a distributed file system designed for storing very large data sets reliably and efficiently across commodity hardware. It has three main components - the NameNode, Secondary NameNode, and DataNodes. The NameNode manages the file system namespace and regulates access to files. DataNodes store and retrieve blocks when requested by clients. HDFS provides reliable storage through replication of blocks across DataNodes and detects hardware failures to ensure data is not lost. It is highly scalable, fault-tolerant, and suitable for applications processing large datasets.
This document outlines the agenda for a training on Oracle RDBMS 12c new features. The training will cover 6 chapters: introduction, multitenant architecture, upgrade features, Flex Cluster, Global Data Service, and an overview of RDBMS features. The agenda provides a high-level overview of topics to be discussed in each chapter, including multitenant architecture concepts, upgrade options and tools, Flex Cluster configurations, Global Data Service components, and new features such as temporary undo and multiple indexes on the same columns.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across commodity hardware. It is highly fault tolerant, provides high throughput, and is suitable for applications with large datasets. HDFS uses a master/slave architecture where a NameNode manages the file system namespace and DataNodes store data blocks. The NameNode ensures data replication across DataNodes for reliability. HDFS is optimized for batch processing workloads where computations are moved to nodes storing data blocks.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
HDFS is a distributed file system designed for storing very large data files across commodity servers or clusters. It works on a master-slave architecture with one namenode (master) and multiple datanodes (slaves). The namenode manages the file system metadata and regulates client access, while datanodes store and retrieve block data from their local file systems. Files are divided into large blocks which are replicated across datanodes for fault tolerance. The namenode monitors datanodes and replicates blocks if their replication drops below a threshold.
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
The document provides an introduction to YARN, HDFS federation, and HDFS high availability. It discusses limitations of the original MapReduce framework and HDFS, such as single points of failure. It then summarizes improvements in YARN including distributed resource management and the ability to run multiple applications. HDFS federation and high availability address scalability and reliability concerns by partitioning the namespace and introducing redundant NameNodes. Configuration parameters and Apache Whirr are also covered for quickly setting up a YARN cluster.
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
There is a plethora of storage solutions for big data, each having its own pros and cons. The objective of this talk is to delve deeper into specific classes of storage types like Distributed File Systems, in-memory Key Value Stores, Big Table Stores and provide insights on how to choose the right storage solution for a specific class of problems. For instance, running large analytic workloads, iterative machine learning algorithms, and real time analytics.
The talk will cover HDFS, HBase and brief introduction to Redis
HDFS is a distributed file system designed for storing very large data sets reliably and efficiently across commodity hardware. It has three main components - the NameNode, Secondary NameNode, and DataNodes. The NameNode manages the file system namespace and regulates access to files. DataNodes store and retrieve blocks when requested by clients. HDFS provides reliable storage through replication of blocks across DataNodes and detects hardware failures to ensure data is not lost. It is highly scalable, fault-tolerant, and suitable for applications processing large datasets.
This document outlines the agenda for a training on Oracle RDBMS 12c new features. The training will cover 6 chapters: introduction, multitenant architecture, upgrade features, Flex Cluster, Global Data Service, and an overview of RDBMS features. The agenda provides a high-level overview of topics to be discussed in each chapter, including multitenant architecture concepts, upgrade options and tools, Flex Cluster configurations, Global Data Service components, and new features such as temporary undo and multiple indexes on the same columns.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hadoop is an open-source software framework for distributed storage and processing of large datasets. It has three core components: HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data as blocks across clusters of commodity servers. MapReduce allows distributed processing of large datasets in parallel. YARN improves on MapReduce and provides a general framework for distributed applications beyond batch processing.
Difference between hadoop 2 vs hadoop 3Manish Chopra
Hadoop 3.x includes improvements over Hadoop 2.x such as supporting Java 8 as the minimum version, using erasure coding for fault tolerance which reduces storage overhead, improving the YARN timeline service for better scalability and reliability, and moving default ports out of the ephemeral range to prevent startup failures. Hadoop 3.x also adds support for the Microsoft Azure Data Lake filesystem and provides better scalability by allowing clusters to scale to over 10,000 nodes. Key features for resource management, high availability, and running analytics workloads are also continued from Hadoop 2.x in Hadoop 3.x.
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
This document provides instructions for completing a hands-on lab to explore Hadoop and big data technologies including HDFS, MapReduce, Pig, Hive, and Jaql. The lab uses a dataset from Google Books to demonstrate word counting and generating histograms of word lengths. Key steps include using Hadoop commands to interact with HDFS, running the WordCount MapReduce program, writing Pig scripts to analyze the data, and using Hive to load the data and generate results. The overall goal is to gain experience using these big data technologies on a Hadoop cluster.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
The document describes a distributed Hadoop architecture with multiple data centers and clusters. It shows how to configure Hadoop to access HDFS files across different name nodes and clusters using tools like ViewFileSystem. Client applications can use a single consistent file system namespace and API to access data distributed across the infrastructure.
The Hadoop Distributed File System (HDFS) has a master/slave architecture with a single NameNode that manages the file system namespace and regulates client access, and multiple DataNodes that store and retrieve blocks of data files. The NameNode maintains metadata and a map of blocks to files, while DataNodes store blocks and report their locations. Blocks are replicated across DataNodes for fault tolerance following a configurable replication factor. The system uses rack awareness and preferential selection of local replicas to optimize performance and bandwidth utilization.
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
The document discusses the Apache Hadoop ecosystem and versions. It provides details on Hadoop versioning from 0.1 to the current versions of 0.22, 0.23, and 1.0. It summarizes the key features and testing of Hadoop 0.22, which has been stabilized by eBay for production use. The document recommends Hadoop 0.22 as a reliable version to use until further versions are released.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Kelkoo uses a Big Data platform including Flume, HDFS, Spark on Yarn, and Hive/SparkSQL. Flume collects log data from various sources and aggregates it into HDFS for distributed storage. HDFS uses a namenode and datanodes for high availability. Spark on Yarn enables distributed processing of the data through Spark applications running executors and tasks across Yarn containers. Hive and SparkSQL allow querying and analyzing the data.
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has very large files (over 100 million files) and is optimized for batch processing huge datasets across large clusters (over 10,000 nodes). HDFS stores multiple replicas of data blocks on different nodes to handle failures. It provides high aggregate bandwidth and allows computations to move to where data resides.
HDFS (Hadoop Distributed File System) is a distributed file system that stores large data sets across clusters of machines. It partitions and stores data in blocks across nodes, with multiple replicas of each block for fault tolerance. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode and DataNodes work together to ensure high availability and reliability even when hardware failures occur. HDFS supports large data sets through horizontal scaling and tools like HDFS Federation that allow scaling the namespace across multiple NameNodes.
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax
Connecting Apache Spark to C* is easy, thanks to DataStax Spark Cassandra Connector. But what about Security?
The DSE bring Enterprise security and Kerberos support to C*. Latest Hadoop distribution has Spark support and also support Kerberos. So now you can add a Cassandra to you Hadoop infrastructure with integrated security and build reliable speed level and streaming applications by combining data from both worlds.
This presentation will show all that fun around security configurations
1. DSE client with SSL and Kerberos
2. Connect from Hadoop Spark to DSE
3. Connect DSE Spark to HDFS sources.
4. And all above even with Widows DC :)
About the Speaker
Artem Aliev Software Developer, DataStax
Artem Aliev is a software developer in the DataStax Analytics team. He works on integrating C* database with analytics solution like Spark and Hive.
- The document describes installing Oracle Real Application Clusters (RAC) and Cluster Ready Services (CRS) on a two-node Windows cluster.
- It involves a two phase installation - first installing and configuring CRS, then installing the Oracle Database with RAC.
- Key steps include configuring shared disks and partitions for the Oracle Cluster Registry, voting disk, and Automatic Storage Management; installing and configuring CRS; and then installing Oracle Database with RAC.
The document discusses erasure coding as an alternative to replication in distributed storage systems like HDFS. It notes that while replication provides high durability, it has high storage overhead, and erasure coding can provide similar durability with half the storage overhead but slower recovery. The document outlines how major companies like Facebook, Windows Azure Storage, and Google use erasure coding. It then provides details on HDFS-EC, including its architecture, use of hardware acceleration, and performance evaluation showing its benefits over replication.
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
HDFS Architecture: An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.
Here we can see the figure explaining about all by a cisco evangelist Ravi Namboori.
Ceph Day London 2014 - The current state of CephFS development Ceph Community
The document discusses recent developments in CephFS. It provides an overview of CephFS architecture including components like clients, servers, storage and data placement. The focus is on improving resilience and making CephFS production-ready with features like online filesystem checking, journal resilience tools, client management and online diagnostics. The goal is to handle failures and diagnose problems in a distributed filesystem environment.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
Hadoop is an open-source software framework for distributed storage and processing of large datasets. It has three core components: HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data as blocks across clusters of commodity servers. MapReduce allows distributed processing of large datasets in parallel. YARN improves on MapReduce and provides a general framework for distributed applications beyond batch processing.
Difference between hadoop 2 vs hadoop 3Manish Chopra
Hadoop 3.x includes improvements over Hadoop 2.x such as supporting Java 8 as the minimum version, using erasure coding for fault tolerance which reduces storage overhead, improving the YARN timeline service for better scalability and reliability, and moving default ports out of the ephemeral range to prevent startup failures. Hadoop 3.x also adds support for the Microsoft Azure Data Lake filesystem and provides better scalability by allowing clusters to scale to over 10,000 nodes. Key features for resource management, high availability, and running analytics workloads are also continued from Hadoop 2.x in Hadoop 3.x.
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
This document provides instructions for completing a hands-on lab to explore Hadoop and big data technologies including HDFS, MapReduce, Pig, Hive, and Jaql. The lab uses a dataset from Google Books to demonstrate word counting and generating histograms of word lengths. Key steps include using Hadoop commands to interact with HDFS, running the WordCount MapReduce program, writing Pig scripts to analyze the data, and using Hive to load the data and generate results. The overall goal is to gain experience using these big data technologies on a Hadoop cluster.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
The document describes a distributed Hadoop architecture with multiple data centers and clusters. It shows how to configure Hadoop to access HDFS files across different name nodes and clusters using tools like ViewFileSystem. Client applications can use a single consistent file system namespace and API to access data distributed across the infrastructure.
The Hadoop Distributed File System (HDFS) has a master/slave architecture with a single NameNode that manages the file system namespace and regulates client access, and multiple DataNodes that store and retrieve blocks of data files. The NameNode maintains metadata and a map of blocks to files, while DataNodes store blocks and report their locations. Blocks are replicated across DataNodes for fault tolerance following a configurable replication factor. The system uses rack awareness and preferential selection of local replicas to optimize performance and bandwidth utilization.
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
The document discusses the Apache Hadoop ecosystem and versions. It provides details on Hadoop versioning from 0.1 to the current versions of 0.22, 0.23, and 1.0. It summarizes the key features and testing of Hadoop 0.22, which has been stabilized by eBay for production use. The document recommends Hadoop 0.22 as a reliable version to use until further versions are released.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
Kelkoo uses a Big Data platform including Flume, HDFS, Spark on Yarn, and Hive/SparkSQL. Flume collects log data from various sources and aggregates it into HDFS for distributed storage. HDFS uses a namenode and datanodes for high availability. Spark on Yarn enables distributed processing of the data through Spark applications running executors and tasks across Yarn containers. Hive and SparkSQL allow querying and analyzing the data.
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has very large files (over 100 million files) and is optimized for batch processing huge datasets across large clusters (over 10,000 nodes). HDFS stores multiple replicas of data blocks on different nodes to handle failures. It provides high aggregate bandwidth and allows computations to move to where data resides.
HDFS (Hadoop Distributed File System) is a distributed file system that stores large data sets across clusters of machines. It partitions and stores data in blocks across nodes, with multiple replicas of each block for fault tolerance. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode and DataNodes work together to ensure high availability and reliability even when hardware failures occur. HDFS supports large data sets through horizontal scaling and tools like HDFS Federation that allow scaling the namespace across multiple NameNodes.
DataStax | DSE: Bring Your Own Spark (with Enterprise Security) (Artem Aliev)...DataStax
Connecting Apache Spark to C* is easy, thanks to DataStax Spark Cassandra Connector. But what about Security?
The DSE bring Enterprise security and Kerberos support to C*. Latest Hadoop distribution has Spark support and also support Kerberos. So now you can add a Cassandra to you Hadoop infrastructure with integrated security and build reliable speed level and streaming applications by combining data from both worlds.
This presentation will show all that fun around security configurations
1. DSE client with SSL and Kerberos
2. Connect from Hadoop Spark to DSE
3. Connect DSE Spark to HDFS sources.
4. And all above even with Widows DC :)
About the Speaker
Artem Aliev Software Developer, DataStax
Artem Aliev is a software developer in the DataStax Analytics team. He works on integrating C* database with analytics solution like Spark and Hive.
- The document describes installing Oracle Real Application Clusters (RAC) and Cluster Ready Services (CRS) on a two-node Windows cluster.
- It involves a two phase installation - first installing and configuring CRS, then installing the Oracle Database with RAC.
- Key steps include configuring shared disks and partitions for the Oracle Cluster Registry, voting disk, and Automatic Storage Management; installing and configuring CRS; and then installing Oracle Database with RAC.
The document discusses erasure coding as an alternative to replication in distributed storage systems like HDFS. It notes that while replication provides high durability, it has high storage overhead, and erasure coding can provide similar durability with half the storage overhead but slower recovery. The document outlines how major companies like Facebook, Windows Azure Storage, and Google use erasure coding. It then provides details on HDFS-EC, including its architecture, use of hardware acceleration, and performance evaluation showing its benefits over replication.
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
HDFS Architecture: An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.
Here we can see the figure explaining about all by a cisco evangelist Ravi Namboori.
Ceph Day London 2014 - The current state of CephFS development Ceph Community
The document discusses recent developments in CephFS. It provides an overview of CephFS architecture including components like clients, servers, storage and data placement. The focus is on improving resilience and making CephFS production-ready with features like online filesystem checking, journal resilience tools, client management and online diagnostics. The goal is to handle failures and diagnose problems in a distributed filesystem environment.
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
Virtualizing big data platforms like Hadoop provides organizations with agility, elasticity, and operational simplicity. It allows clusters to be quickly provisioned on demand, workloads to be independently scaled, and mixed workloads to be consolidated on shared infrastructure. This reduces costs while improving resource utilization for emerging big data use cases across many industries.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x
HDFS is a distributed file system designed for large data sets and high throughput access. It uses a master/slave architecture with a Namenode managing the file system namespace and Datanodes storing file data blocks. Blocks are replicated across Datanodes for fault tolerance. The system is highly scalable, handling large clusters and files sizes ranging from gigabytes to terabytes.
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely.
Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets.
Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop:
* How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform
* How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata
* An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively.
* Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process
* Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange
Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud.
Speakers:
Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd
Amin Abbaspour, Senior Solutions Architect, Data Republic
This document outlines the key tasks and responsibilities of a Hadoop administrator. It discusses five top Hadoop admin tasks: 1) cluster planning which involves sizing hardware requirements, 2) setting up a fully distributed Hadoop cluster, 3) adding or removing nodes from the cluster, 4) upgrading Hadoop versions, and 5) providing high availability to the cluster. It provides guidance on hardware sizing, installing and configuring Hadoop daemons, and demos of setting up a cluster, adding nodes, and enabling high availability using NameNode redundancy. The goal is to help administrators understand how to plan, deploy, and manage Hadoop clusters effectively.
This document provides an overview of Hadoop architecture and the Hadoop Distributed File System (HDFS). It discusses Hadoop core components like HDFS, YARN and MapReduce. It also covers HDFS architecture with the NameNode and DataNodes. Additionally, it explains Hadoop configuration files, modes of operation, commands and daemons.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Scaling HDFS with a Strongly Consistent Relational Model for MetadataHooman Peiro Sajjad
This document proposes scaling HDFS metadata by storing it in a distributed database instead of solely on the NameNode. It discusses:
1. Storing HDFS file and block metadata in MySQL Cluster, a distributed in-memory database, to allow a stateless NameNode and improve scalability.
2. Using database transactions to provide strong consistency for metadata operations through row-level locking and read-committed isolation.
3. Ways to further optimize throughput, such as implicit subtree locking and snapshot isolation to avoid locking conflicts during reads.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
VMworld 2013
Chris Greer, FedEx
Richard McDougall, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
The document discusses Hadoop infrastructure at TripAdvisor including:
1) TripAdvisor uses Hadoop across multiple clusters to analyze large amounts of data and power analytics jobs that were previously too large for a single machine.
2) They implement high availability for the Hadoop infrastructure including automatic failover of the NameNode using DRBD, Corosync and Pacemaker to replicate the NameNode across two servers.
3) Monitoring of the Hadoop clusters is done through Ganglia and Nagios to track hardware, jobs and identify issues. Regular backups of HDFS and Hive metadata are also performed for disaster recovery.
Similar to IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS (20)
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
IEEE SRDS'12: From Backup to Hot Standby: High Availability for HDFS
1. André Oriani
Islene Calciolari Garcia
Institute of Computing – University of Campinas-Brazil
FROM BACKUP TO HOT STANDBY:
HIGH AVAILABILITY FOR HDFS
2. AGENDA
• Motivation;
• Architecture of HDFS 0.21;
• Implementation of Hot Standby Node;
• Experiments and Results;
• High Availability features on HDFS 2.0.0-alpha;
• Conclusions and Future Work.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 2
3. MOTIVATION
CLUSTER-BASED PARALLEL FILE SYSTEMS
Master-Slaves Architecture:
• Metadata Server – serves clients, manages the namespace.
• Storage Servers – store the data.
Centralized System Metadata Server is a SPOF
Importance of a Hot Standby for the metadata server of HDFS
Cold start of a 2000-node HDFS cluster with 21PB and 150 million files
takes ~ 45min [8].
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 3
4. HADOOP DISTRIBUTED FILE SYSTEM
(HDFS) 0.21
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 4
NameNodeClient
DataNode DataNodeDataNode
Backup
Node
5. DATANODES
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 5
Storage Nodes
• Files are split in equal-sized blocks.
• Blocks are replicated to DataNodes.
• Send status messages to NameNode:
• Heartbeats;
• Block-Reports;
• Block-Received.
NameNode
Statuses
6. NAMENODE
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 6
NameNodeClient
DataNode DataNode
Requests
Metadata
Hearbeats
Block-Reports
Block-Received
Commands
Metadata Server
• Manages the file system tree.
• Handles metadata requests.
• Controls access and leases.
• Manages Blocks:
• Allocation;
• Location;
• Replication levels.
7. NAMENODE’S STATE
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 7
File System
Tree
Block
Management
Leases
LOG
Journaling
• All state is kept in RAM for
better performance.
• Changes to namespace
are recorded to log.
• Lease and Block
information is volatile.
8. BACKUP NODE
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 8
NameNode
Backup
Node
Journal Entries
Checkpoint
Checkpoint Helper
• NameNode streams changes to Backup Node.
• Efficient checkpoint strategy: apply changes to its own state.
• Checkpoint Backup’s state == Checkpoint NameNode’s state.
9. A HOT STANDBY FOR HDFS 0.21
• Backup Node: Opportunity
• Already replicates namespace state.
• NameNode’s subclass Can process client requests.
• Evolving the Backup Node into Hot Standby Node:
1. Handle the missing state components:
1. Replica locations
2. Leases
2. Detect NameNode’s Failure.
3. Switch the Hot Standby Node to active state (failover).
4. Disseminate current metadata server information.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 9
10. Reuses AvatarNodes’
strategy:
• Sends DataNodes’ status
Messages to Hot Standby
too.
• No rigid sync
• DataNode failures are
relatively common.
Stable clusters: 2-3
failures per day in
1,000 nodes [3]
• Hot Standby Node is
kept on safe mode. It
becomes the authority
once it is active
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 10
MISSING STATE
REPLICA LOCATIONS
NameNode
DataNode
Hot Standby
Node
Heartbeats
Block-Reports
Block-Received
11. Not replicated:
• Blocks are only
recorded to log
when file is closed.
• So any write in
progress is lost if
NameNode fails.
• Restarting the write
will create a new
lease.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 11
MISSING STATE
LEASES
NameNodeClient LOG
open(file)
open(file)
getAdditionalBlock()
getAdditionalBlock()
getAdditionalBlock()
close(file)
complete(file,blocks)
addLease(file,client)
12. Uses ZooKeeper
• Highly available
distributed coordination
service.
• Keeps a tree of znodes
replicated among the
ensemble.
• All operations are
atomic.
• Leaf node may be
ephemeral.
• Clients can watch
znodes.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 12
FAILURE DETECTION
NameNode
Hot Standby
Node
Hot StandbyNameNode
ZooKeeper
Ensemble
Namenodes
active’s IP
Client
13. NameNode fails
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 13
FAILOVER
NameNode
Hot Standby
Node
Namenodes
Hot Standby
ZooKeeper
Ensemble
14. Switch Hot Standby Node to
active
1. Stop checkpointing;
2. Close all open files;
3. Restart lease management;
4. Leave safe mode;
5. Update group znode to its
network address.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 14
FAILOVER(CONT.)
Hot Standby
Node
15. EXPERIMENTS
• Environment Test : Amazon EC2
• 1 Zookeeper server;
• 1 NameNode;
• 1 Backup Node or Hot Standby Node;
• 20 DataNodes;
• 20 Clients running the test programs;
• All small instances.
• Performance tests - Comparison against HDFS 0.21.
• Failover tests - NameNode is shutdown when block count is more than 2K.
• Two test scenarios
• Metadata : Each client creates 200 files of one block each;
• I/O : Each client creates a single file of 200 blocks;
• 5 sample per scenario.
• Source code, tests scripts and raw data:
• Available at https://sites.google.com/site/hadoopfs/experiments
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 15
16. RESULTS OVERVIEW
Lines of Code:
1373 lines: 0.18% of original code for HDFS 0.21.
Performance:
• NameNode
• Failover Manager overhead: increase of 16% in CPU time and
12% in heap memory compared to HDFS 0.21.
• DataNodes
• No considerable change on network traffic. Extra messages are
less than 0.43% of total flow out of DataNode.
• Substantial overhead only on I/O scenario: 17% in CPU and 6%
in heap memory compared to HDFS 0.21 in the same scenario.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 16
17. RESULTS OVERVIEW(CONT.)
• Big Block-Received Message Problem:
• Hot Standby only knows (thru log) the blocks of a file when it is closed.
• Hot Standby returns non-recognized blocks, so DataNodes can retry them on
next block-received message.
• In I/O scenario files have 200 blocks Many pending blocks will be retried
until files are closed larger block-received messages and responses
Processing Memory
• Hot Standby Node
• CPU time 3.3 times and heap memory 1.9 times higher in I/O scenario
compared to metadata scenario.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 17
DataNode
Hot Standby
Node
Block-Received: new + old blocks
Response: non-recognized blocks
18. PERFORMANCE RESULTS
THROUGHPUT AT CLIENTS
METADATA
6.24
25.8
5.45
25.32
0
5
10
15
20
25
30
Write Read
Throughtput(MB/s)
HDFS 0.21 Hot Standby
I/O
4.72
13.89
4.86
12.39
0
5
10
15
20
25
30
Write ReadThroughtput(MB/s)
HDFS 0.21 Hot Standby
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 18
19. FAILOVER RESULTS
METADATA I/O
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 19
NameNode Failure
Hot Standby is notified of
the failure.
First request processed
NameNode Failure
Hot Standby is notified of
the failure.
First request processed
Zookeeper session timeout : 2 min
(1.62±0.23)
min.
(2.31±0.46)
min.
0.24%
22%
20. HDFS 2.0.0-ALPHA
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 20
Name
Node
Shared
Storage
Standby
Name
Node
Client
Data
Node
writes log reads log
Released on May 2012
• DataNodes send
messages to both nodes.
• Transactional log is
written to High Available
Shared Storage.
• Standby keeps reading
log from storage.
• Blocks are logged as
they are allocated.
• Manual Failover with IO
fencing.
21. CONCLUSIONS AND
FUTURE WORK
• We built a high availability solution for HDFS that is capable of
delivering good throughput with low overhead to existing
components.
• The solution has a reasonable reaction time to failures and
works well in Elastic Computing environments.
• Impact on the code base was small and no components
external to the Hadoop Project were required.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 21
22. CONCLUSIONS AND
FUTURE WORK (CONT.)
• Our results showed that we can improve the performance if we
handle better the new blocks. If the Hot Standby becomes
aware of which blocks compose a file before it is closed, we will
be able to continue writes. We also plan to support
reconfiguration.
• High Availability on HDFS is still a open problem:
• Multiple failure support;
• Integration with BooKeeper;
• Using HDFS itself to store the logs.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 22
23. ACKNOWLEDGMENTS
• Rodrigo Schmidt
• Alumnus of University of Campinas (Unicamp);
• Facebook Engineer.
• Motorola Mobility
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 23
24. REFERENCES
1. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Mass
Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 3-7 2010, pp. 1 –10.
2. “Facebook has the world’s largest Hadoop cluster!” http://hadoopblog. blogspot.com/2010/05/facebook-
has- worlds- largest- hadoop.html, last access on September 13th, 2012.
3. K. Shvachko, “HDFS Scability: The Limits to Growth,” ;login: The Usenix Magazine, vol. 35, no. 2, pp.
6–16, April 2010.
4. “Apache Hadoop,” http://hadoop.apache.org/, last access on September 13th, 2012.
5. “HBase,” http://hbase.apache.org/, last access on September 13th, 2012.
6. B. Bockelman, “Using Hadoop as a grid storage element,” Journal of Physics: Conference Series, vol.
180, no. 1, p. 012047, 2009. [Online].Available: http://stacks.iop.org/1742-6596/180/i=1/a=012047
7. D. Borthakur, “HDFS High Availability,” http://hadoopblog.blogspot. com/2009/11/hdfs-high-
availability.html, last access on September 13th, 2012.
8. D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg, H. Kuang, K. Ranganathan, D.
Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache Hadoop Goes Realtime at Facebook,” in
Proceedings of the 2011 international conference on Management of data, ser. SIGMOD ’11. New York,
NY, USA: ACM, 2011, pp. 1071– 1080. [Online]. Available: http://doi.acm.org/10.1145/1989323.1989438
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 24
25. REFERENCES
9. “Apache Zookeeper,” http://hadoop.apache.org/zookeeper/, last access on September 13th,
2012.
10. “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html, last
access on September 13th, 2012.
11. “Streaming Edits to a Standby Name-Node,” http://issues.apache.org/jira/browse/HADOOP-
4539, last access on September 13th, 2012.
12. “Hot Standby for NameNode,” http://issues.apache.org/jira/browse/HDFS-976, last access on
September 13th, 2012.
13. D. Borthakur, “Hadoop AvatarNode High Availability,” http://
14. hadoopblog.blogspot.com/2010/02/hadoop- namenode- high- availability.html, last access on
September 13th, 2012.
15. “Amazon EC2,” http://aws.amazon.com/ec2/, last access on September 13th, 2012.
16. “HighAvailabilityFrameworkforHDFSNN,”https://issues.apache.org/ jira/browse/HDFS-1623, last
access on September 13th, 2012.
17. “Automatic failover support for NN HA,” http://issues.apache.org/jira/browse/HDFS-3042, last
access on September 13th, 2012.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 25
31. PERFORMANCE RESULTS
NAMENODE RPC DATAFLOW
METADATA
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
Received Sent
DataFlow(MB)
HDFS 0.21 Hot Standby
I/O
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
Received SentDataFlow(MB)
HDFS 0.21 Hot Standby
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 31
32. • Group znode holds the
IP address of the active
metadata server.
• Client query Zookeeper
and register to be
notified of changes.
SRDS’12 - From Backup to Hot Standby: High Availability for HDFS 32
FINDING THE ACTIVE SERVER
Namenodes
active’s IP
Client
Watcher
Notification
Query
Editor's Notes
Good morning Everyone. I am André Oriani from University of Campinas, Brazil. I’m going to present our work : From Backup to Hot Standby: High Availability for HDFS.
So here’s today agenda:I’m gonna talk briefly about cluster-based parallel file systems, then the architecture of HDFS 0.21, our implementation of a Hot Standby. give some overview about our experiments and results and finish talking about the high availability features introduced on HDFS 2-alpha.
Cluster-based parallel file system generally adopt a master-slaves architecture. The master, the metadata-server, manages the namespace; while the slaves, the storage nodes, store the data. Although this architecture is simple to implement and maintain, as in any centralized system the master, the metadata server, is a single point of failure.One widely used specimen of such file systems is the Hadoop Distributed File System. And what is a importance of a Hot standby for the HDFS? Well, A cold start for a big cluster such as Facebook’s can take about 45 minutes, making it no viable for 24 by seven applications.
This slide show the architecture and the data flows for HDFS 0.21. I’m gonna describe each node.
Like in any Parallel File System, file are split into equal sized blocks that are stored in the Data Nodes, the storage Nodes. In particular a block is replicated to 3 DataNodes for reliability. They constantly communicating their status to the metadata server, the Name Node thru messages:Heartbeats – which are not only used to tell the DataNode is still alive, but also to tell about its load and free space.Block-reports are a list of healthy the DataNode can offer. And the DataNode receives a set of new blocks , It sends a block-received message. From those message, the Name Node can build its global view of the cluster, knowing where each block replica is.
As I said the NameNode is the metadata server and thus it is responsible for servicing the clients, handling metadata requests and and controlling access to files. HDFS offers POSIX-like permission and leases. A client can only write to a file if it has a lease for it.The NameNode also manages the block allocation, location and the replication status. To accomplish that work , the Name Node can send commands to DataNodes in the response to the hearbeatmessages.
The Name Node keeps all its state on main memory for performance reasons. In other to have some resilience, the Name Node employs journaling: every change to the file system tree is recorded to a log file. Information about lease and blocks are not recorded to log because they are ephemeral data. So they are lost if the namenode fails.
If not action were taken , NameNode would end up with a big transactional log, what would seriously impact its startup time. So it counts on the Backup Node, its checkpointer helper , to compact the log in the form of a serialized version of the file system tree. The Backup Node employs an efficient checkpoint strategy. The Name Node streams all changes to it, so it can apply to it owns state. So to generate a checkpoint for the Name Node it only needs to checkpoint itself.
We found the Backup Node to a great opportunity for implementing a hot standby for the NameNode. It already does some state replication and because it is a subclasses of Name Node it can potentially handle client requests. In fact , turning into a Hot standby node was a long term goal for the Backup Node when it was created. To evolve the Backup Node into a Hot Standby, we had to handle the missing namenode’s state components , create an automatic failover mechanism, and means to disseminate the information about the current active metadata server.
To replicate the information about blocks, we reused a technique developed in another high-availability solution, the Avatar Nodes from Facebook. The technique consists in modifying the Data Nodes to also send the status messages to the Hot Standby. The Hot Standby Node is kept on safe mode mode to not issue commands to DataNode which would conflict with Name Node’s. There is no rigid synchronization among the duplicated messages because datanodes fail at considerable rates and the nodes are made to handle that. An once it becomes active , the Hot Standby will become the file system authority, so only its view will matter.
Regarding leases we decided to not replicate them. The reason for that blocks that compose a file are only recorded to the transactional log when the file is closed. So if the Name Node fails while a file was being written, all the blocks of the file are lost. So the client will need to restart the write, requiring a new leases. Thus the previous leases is not going to use and therefore it doesn’t need to be replicated. This behavior is somewhat tolerable by applications. MapReduce will retry any failed job and Hbase will only commit a commit a transaction when the file is flushed and synced.
In order to detect Name Node’s failure we use Zookeeper, Zookeeper is subproject of Hadoop that provides a high available distributed coordination service. It keeps a replicated tree of znodes among the server of the ensemble. One interesting feature of Zookeeper is that the znodes can be ephemeral: if the session of client that created expires , the znode is removed. So you can implement some liveness detection using this principle. You can also register to be notified of such events. So, when they both the Name Node and the Hot Standby create a ephemeral znode for them, under a znode to represent the group. The namenode will write is network address to the group znode.
When the Name Node fails , its session with Zookeeper will eventually expires, so its znode gets removed. The Hot Standby is notified about that and it starts the failover procedures.
The Hot standby wills stop the checkpoint, close all open files. restart the lease management, leave the safe mode so it can control the Data Nodes. And it writes it network address on the znode of the group.
We did experiments in order to determine the overhead implied by our solution over the HDFS 0.21 and the total failover time. We did both tests on two scenarios: one scenario more oriented towards metadata operations and another towards I/O operations. For each scenario we run the test times.The tests were executed on Amazon EC2 using 43 small instances. Source code,test scripts and raw data are available at the address denoted in the slide.
As time is short I am gonna give an overview of the results.The complete implementation took less then fourteen hundred lines. Thus is easy for others to understand the implementation and maintain.Regarding the performance overhead, the namenode should not be impacted since it was not changed by the solution, but its process also hosts the Failover Manager, to we saw Increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21For theDataNodes there was not considerable change in the network traffic . The extra messages sent to the hot standby got diluted in the I/O flow created bry clients reading and writing files. We only observed substantial overhead in the I/O Scenario. We saw increase of 17% in the CPU time and 6% in the heap if compared to HDFS 0.21 DataNode’s in the same scenario. Tha t is caused by a growth in the blocked received messages. Remember the Hot standby only becomes aware of block that compose a certain file is closed. If the hot standby node receives block-received message from a block it does not know about, it will return that block, so the datanode can retry to that block in the next block-received message again. The trouble is that in the I/O scenario the files are 200 block-long and the hot standby node will only recognized any block aof a file when the 200 block were written. So the block-receive messages become long, taking a lot of processing and memory from the Data Nodes and the Hot Standby. And because the hot standby is just one , he’s the most affected node.
As time is short I am gonna give an overview of the results.The complete implementation took less then fourteen hundred lines. Thus is easy for others to understand the implementation and maintain.Regarding the performance overhead, the namenode should not be impacted since it was not changed by the solution, but its process also hosts the Failover Manager, to we saw Increase of 16% in CPU time and 12% in heap memory compared to HDFS 0.21For theDataNodes there was not considerable change in the network traffic . The extra messages sent to the hot standby got diluted in the I/O flow created bry clients reading and writing files. We only observed substantial overhead in the I/O Scenario. We saw increase of 17% in the CPU time and 6% in the heap if compared to HDFS 0.21 DataNode’s in the same scenario. Tha t is caused by a growth in the blocked received messages. Remember the Hot standby only becomes aware of block that compose a certain file is closed. If the hot standby node receives block-received message from a block it does not know about, it will return that block, so the datanode can retry to that block in the next block-received message again. The trouble is that in the I/O scenario the files are 200 block-long and the hot standby node will only recognized any block aof a file when the 200 block were written. So the block-receive messages become long, taking a lot of processing and memory from the Data Nodes and the Hot Standby. And because the hot standby is just one , he’s the most affected node.
Despite of those problems, the data throughput is still good. We consider the data throughput the most important metric because it measures how much works can be done in behalf of clients . In average, the data throughput was never less than 2 MB/s of throughput achieved by the HDFS 0.21 in both scenario, for read and write.
Failover time…We use a timeout of 2 minutes because it was a safe value to avoid false positive in the virtualized environment of Amazon EC2.We are considering the failover from the time the Name Node fails until the Hot Standby process its first request. In both case the failover took less than 3 minutes. The time from the start of the Hot Standby’s transition until the first request, that it is the time that we can impact with our implementation took only 0.24% of the total failover in the metada scenario. However in the I/O scenario that jumps to 22% because of the problem with block-received messages I just mentioned. The Hot standby node is just too busy processing blocks the transition takes longer. Once it is finish , thingst get worse, because the hot standby not only has to process the block-received but has to instruct datanodes to remove block of all in progress writes, so first resquest is delayed for a long time, although client could react almost instantaneously.
HDFS 2-aplha , released on may of this year has introduced some high availability features. They also use the technique of modifying the DataNodes to also send their messages to the hot standby. But instead of streaming the changes, the active metadata server keeps the logs in a shared storage, and the standby keeps on reading the log from the storage in order to update itself. So the high availability issue is transferred from the file system to the shared storage, which is a external component and needs to be high available. It logs blocks as they are allocated avoid the problem we have. Currently it only supports manual failover, so it is targeted to maintenance and upgrades, but a automatica failover is very likely to be in the next release. They employ IO fencing mechanisms to prevent both namenodes from writing to the shared storage.
How the client can determine which is the current metadata server? Remember the Name Node writes it address to the group znode when it starts, and the Hot standby writes it address when it finishes the failover. So the group znode always keep the address of active metadata server up-to-date. So clients just need to query Zookeeper for that znode and register to be notified of changes.