SlideShare a Scribd company logo
1 of 21
Download to read offline
HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Namenode High Availability
Uma Maheswara Rao G
maheswara@huawei.com
umamahesh@apache.org
HUAWEI TECHNOLOGIES CO., LTD. Page 2
Who am I
R&D Tech Lead in Huawei
Apache Hadoop Committer at ASF
Active Contributor to Apache BookKeeper at ASF
Main development focus on HDFS
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 3
Huawei Hadoop R&D – In a Glance
• Secondary Index in HBase
• HDFS NN HA (Hadoop-2)
• Bookkeeper as shared storage for NN HA (Hadoop-2)
• HDFS NN HA (Hadoop-1)
• MapReduce ResourceManager HA (Hadoop-2 / YARN)
• MapReduce JobTracker HA (Hadoop-1)
• Hive HA
Hadoop Development
• Raised over 650 defects since Jan’11
• Fixed over 500 defects since Jan’11, and contributed back
to community.
Stabilization
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 4
In 2011, we implemented HA in Hadoop 0.20.1, based on
Backup Namenode(BNN) and ZooKeeper at Huawei
 Intelligent clients - find active NN from configured NNs
 Streaming edits to the BNN
 Sending block reports to both active NN and BNN
 BNN does periodic checkpoints
 ZK based leader election
Achieved hot standby and Automatic Switch
However, BNN based solution did not address double failure
scenario
Namenode High Availability
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 5
What is Double Failure here?
Active NN Backup NN
Active NN Start - Active NN
Checkpoint – t1
Checkpoint – t2
NN-1 NN-2
NN-1 NN-2
X - Operation logs in NN1 and could not stream to
BNN (NN2)
Not aware of X - Operation logs
from NN1. DataLoss!!!
 NN1 is active and NN2 is Backup node
 NN1 failed to stream edits to the Backup node and removed the stream
 NN1 received X- operation edit logs, where NN2 not aware of them
 Now NN1 crashed, NN2 becomes active
 NN2 continues to work but not aware of the edits after checkpoint:t1
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 6
In 2011 - 2012, Community started discussion on Shared storage
approach
 HDFS-1623 : Discussed double failure issue
 Store the edit logs at a common place and share to both the
Namenodes
Huawei collaborated with community in HDFS-1623
development
Community discussions on double failure
Constraints with the HDFS-1623:
The shared storage is an SPOF. A highly available shared storage is
required.
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 7Page 7
NameNode HA Architecture
Both NameNodes start as
standby and try loading the edits
Namenode
Standby
Failover
Controller
Namenode
Active
Failover
Controller
Shared storage
DN 1 DN 2 DN n
ZK 1 ZK 2 ZK n
Cmd
Monitor
NN
health
Namenode
Standby
Write
Block reports
Read Read
Monitor
NN
health
New process FailOver
Controller (ZKFC) responsible
for monitoring and failover
On ZK based leader election
ZKFC issue command to its local
NN to become Active or
Standby
Only Active NN will write to
shared storage, Standby NN will
read from it.
All DataNodes will send block
reports to both Namenodes to
achieve hot Standby.
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 8
Shared storage options
NFS: May not be a better fit for many deployments as NFS is an
external device, costly, less control on timeouts etc.
QuorumJournalManager(QJM): Newly implemented in HDFS. It did
not come out from any of the Apache releases yet.
BoookKeeperJournalManager(BKJM): Integrated BookKeeper based
shared storage to HDFS. Available in Hadoop-2 onwards.
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 9
 BookKeeper is mainly inspired by Namenode. Can work as shared storage with HA
Namenode
 Can Support Federated Namenodes
 Dynamic scaling up/down of storage nodes(BK)
– Useful when HBase, Federated NN also uses it.
 Multiple disk support - Reduces the disk seeks
 Round Robin reads from quorum - distributes the load
 HDFS-234 - BookKeeper based Journal manager
- Available in Hadoop-2 releases and trunk
 Quorum based commits in trunk
– Parallel writes to write-quorum Bookies and wait for ack-quorum (see
BOOKKEEPER-208)
NameNode HA
Motivations to use BookKeeper as shared storage
Contd..
HUAWEI TECHNOLOGIES CO., LTD. Page 10
Motivations to use BookKeeper as shared storage
 BookKeeper is in production use at Yahoo! for guaranteed delivery of log
messages to Hedwig Servers
 Code was there for more than a year in Apache and has active
contributors and committers list
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 11
BookKeeper at glance
Bookie: Storage node
Ledger: log file
Ensemble: group of bookies
storing a ledger
Writes to quorums of Bookies
Parallel writes to quorums
Wait for ack quorum (in BK 4.2)
Reads from the same quorum
NameNode HA
Throughput: max latency of ack quorum bookies response
HUAWEI TECHNOLOGIES CO., LTD. Page 12Page 12Page 12
NN HA Architecture with BK as shared storage
NameNode HA
Failover
Controller
Failover
Controller
Cmd
Monitor
NN
health
Write
Block reports
Read
Monitor
NN
health
BK 1 BK 2 BK n
Namenode
Active
B
K
J
M
Namenode
Standby
B
K
J
M
BookKeeper Journal Manager
(BKJM) is NameNode plugin
implementation, involves BK
client to read/write to/from BK
cluster
ZK 1 ZK 2 ZK n
DN 1 DN 2 DN n
HUAWEI TECHNOLOGIES CO., LTD. Page 13Page 13
NameNode HA
Depends on ZooKeeper for leader
election
Monitors the health of Namenodes
Invoke the RPC calls to NN to switch
the state when leader election
happens
Perform leader election when health
monitor find NN is bad
Fencing: ZKFC will kill the remote
NN before invoking switch
commands to ensure single Active
NN at any time.
ZooKeeper Failover Controller (ZKFC)
Namenode
Monitor
health
Transiti
on to
Active/
Standby
Namenode
Monitor
health
Transiti
on to
Active/
Standby
Fencing
Failover
Controlle
r
Failover
Controller
ZK
1
ZK
2
ZK
n
HUAWEI TECHNOLOGIES CO., LTD. Page 14
What if remote node is not accessible to kill?
• Requires to write a fence method with shared storage.
Note: Shared storage should support fencing for this purpose
How BK based shared storage handles fencing?
• BookKeeper allows single writer for a ledger
“Eliminates the need of strict ZKFC level fencing”
• BKJM solves the issue of concurrent ledger creations by two Active
Namenodes - see HDFS-3452
Special fencing scenarios:
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 15
Configurations
 Namenode side:
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/haedit</value>
</property>
<property>
<name>dfs.namenode.edits.journal-plugin.bookkeeper</name>
<value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value>
</property>
 BookKeeper side:
zkServers = zk1:2181,zk2:2181,zk3:2181
Bookies can start with remaining default configurations
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 16
NameNode HA
Other Enhancements/Work In Progress
BOOKKEEPER-237 - Automatic recovery on Bookie failures.
Importance – To avoid the data-loss when multiple Bookie crashes
Entry read latency improvement with (round-robin + speculative) reads with less
timeout - Avoid latencies in switch
Already developed support for shared storage upgrades
Further improvements and work progress is in HDFS-3399
Take a look at BookKeeper community work
https://issues.apache.org/jira/browse/BOOKKEEPER
HUAWEI TECHNOLOGIES CO., LTD.
BookKeeper Auto Recovery (BOOKKEEPER-237)
 Autorecovery is a new module in BookKeeper
 Re-replicates the ledgers automatically when there are Bookie failures
Watch on Available Bookies
Detects the failed Bookie
ledgers
Add that ledgers ids in ZK
Auditor:
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD.
Monitor for the ledger ids
published by Auditor
Scan and find the missed
fragments from the ledger
Initiates the re-replication
Clean up the re-replicated ledger
ids from ZK
ReplicationWorker:
NameNode HA
BookKeeper Auto Recovery (BOOKKEEPER-237)
HUAWEI TECHNOLOGIES CO., LTD.
Testing
Tested using HA cluster with HBase, MR, Hive, Huawei OM
500+ automated integration HA tests running in daily CI
Verified the fault tolerant tests with BK recovery feature – see the
list in HDFS-3399
Periodic kill node scenarios to verify the switch with BK – Did not
see any data-loss
In our test cluster, we did not see significant degradations and
expected linear degradations when we increase quorum
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 20
Any queries?
NameNode HA
Thank you
www.huawei.com

More Related Content

What's hot

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith SharmaNewton Alex
 
HBase replication
HBase replicationHBase replication
HBase replicationwchevreuil
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scaleNewton Alex
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseMichael Stack
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaKamesh Pemmaraju
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil GovindanNewton Alex
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
 
Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataStéphane Heckel
 
Chef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleChef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleBiju Nair
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

What's hot (20)

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
HBase replication
HBase replicationHBase replication
HBase replication
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big data
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Chef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleChef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scale
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Similar to Apache HDFS High Availability

State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 
HA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedHA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedGanapathi Kandaswamy
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAYthevijayps
 
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Edureka!
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesRobert Lemke
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski QAware GmbH
 
Docker for developers on mac and windows
Docker for developers on mac and windowsDocker for developers on mac and windows
Docker for developers on mac and windowsDocker, Inc.
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Community
 
Hadoop ha system admin
Hadoop ha system adminHadoop ha system admin
Hadoop ha system adminTrieu Dao Minh
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker, Inc.
 
DockerCon 18 docker storage
DockerCon 18 docker storageDockerCon 18 docker storage
DockerCon 18 docker storageDaniel Finneran
 

Similar to Apache HDFS High Availability (20)

State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
HA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedHA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and Keepalived
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAY
 
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski
 
Docker for developers on mac and windows
Docker for developers on mac and windowsDocker for developers on mac and windows
Docker for developers on mac and windows
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
 
Hadoop ha system admin
Hadoop ha system adminHadoop ha system admin
Hadoop ha system admin
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent data
 
DockerCon 18 docker storage
DockerCon 18 docker storageDockerCon 18 docker storage
DockerCon 18 docker storage
 

Recently uploaded

Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 

Recently uploaded (16)

Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 

Apache HDFS High Availability

  • 1. HUAWEI TECHNOLOGIES CO., LTD. www.huawei.com Namenode High Availability Uma Maheswara Rao G maheswara@huawei.com umamahesh@apache.org
  • 2. HUAWEI TECHNOLOGIES CO., LTD. Page 2 Who am I R&D Tech Lead in Huawei Apache Hadoop Committer at ASF Active Contributor to Apache BookKeeper at ASF Main development focus on HDFS NameNode HA
  • 3. HUAWEI TECHNOLOGIES CO., LTD. Page 3 Huawei Hadoop R&D – In a Glance • Secondary Index in HBase • HDFS NN HA (Hadoop-2) • Bookkeeper as shared storage for NN HA (Hadoop-2) • HDFS NN HA (Hadoop-1) • MapReduce ResourceManager HA (Hadoop-2 / YARN) • MapReduce JobTracker HA (Hadoop-1) • Hive HA Hadoop Development • Raised over 650 defects since Jan’11 • Fixed over 500 defects since Jan’11, and contributed back to community. Stabilization NameNode HA
  • 4. HUAWEI TECHNOLOGIES CO., LTD. Page 4 In 2011, we implemented HA in Hadoop 0.20.1, based on Backup Namenode(BNN) and ZooKeeper at Huawei  Intelligent clients - find active NN from configured NNs  Streaming edits to the BNN  Sending block reports to both active NN and BNN  BNN does periodic checkpoints  ZK based leader election Achieved hot standby and Automatic Switch However, BNN based solution did not address double failure scenario Namenode High Availability NameNode HA
  • 5. HUAWEI TECHNOLOGIES CO., LTD. Page 5 What is Double Failure here? Active NN Backup NN Active NN Start - Active NN Checkpoint – t1 Checkpoint – t2 NN-1 NN-2 NN-1 NN-2 X - Operation logs in NN1 and could not stream to BNN (NN2) Not aware of X - Operation logs from NN1. DataLoss!!!  NN1 is active and NN2 is Backup node  NN1 failed to stream edits to the Backup node and removed the stream  NN1 received X- operation edit logs, where NN2 not aware of them  Now NN1 crashed, NN2 becomes active  NN2 continues to work but not aware of the edits after checkpoint:t1 NameNode HA
  • 6. HUAWEI TECHNOLOGIES CO., LTD. Page 6 In 2011 - 2012, Community started discussion on Shared storage approach  HDFS-1623 : Discussed double failure issue  Store the edit logs at a common place and share to both the Namenodes Huawei collaborated with community in HDFS-1623 development Community discussions on double failure Constraints with the HDFS-1623: The shared storage is an SPOF. A highly available shared storage is required. NameNode HA
  • 7. HUAWEI TECHNOLOGIES CO., LTD. Page 7Page 7 NameNode HA Architecture Both NameNodes start as standby and try loading the edits Namenode Standby Failover Controller Namenode Active Failover Controller Shared storage DN 1 DN 2 DN n ZK 1 ZK 2 ZK n Cmd Monitor NN health Namenode Standby Write Block reports Read Read Monitor NN health New process FailOver Controller (ZKFC) responsible for monitoring and failover On ZK based leader election ZKFC issue command to its local NN to become Active or Standby Only Active NN will write to shared storage, Standby NN will read from it. All DataNodes will send block reports to both Namenodes to achieve hot Standby. NameNode HA
  • 8. HUAWEI TECHNOLOGIES CO., LTD. Page 8 Shared storage options NFS: May not be a better fit for many deployments as NFS is an external device, costly, less control on timeouts etc. QuorumJournalManager(QJM): Newly implemented in HDFS. It did not come out from any of the Apache releases yet. BoookKeeperJournalManager(BKJM): Integrated BookKeeper based shared storage to HDFS. Available in Hadoop-2 onwards. NameNode HA
  • 9. HUAWEI TECHNOLOGIES CO., LTD. Page 9  BookKeeper is mainly inspired by Namenode. Can work as shared storage with HA Namenode  Can Support Federated Namenodes  Dynamic scaling up/down of storage nodes(BK) – Useful when HBase, Federated NN also uses it.  Multiple disk support - Reduces the disk seeks  Round Robin reads from quorum - distributes the load  HDFS-234 - BookKeeper based Journal manager - Available in Hadoop-2 releases and trunk  Quorum based commits in trunk – Parallel writes to write-quorum Bookies and wait for ack-quorum (see BOOKKEEPER-208) NameNode HA Motivations to use BookKeeper as shared storage Contd..
  • 10. HUAWEI TECHNOLOGIES CO., LTD. Page 10 Motivations to use BookKeeper as shared storage  BookKeeper is in production use at Yahoo! for guaranteed delivery of log messages to Hedwig Servers  Code was there for more than a year in Apache and has active contributors and committers list NameNode HA
  • 11. HUAWEI TECHNOLOGIES CO., LTD. Page 11 BookKeeper at glance Bookie: Storage node Ledger: log file Ensemble: group of bookies storing a ledger Writes to quorums of Bookies Parallel writes to quorums Wait for ack quorum (in BK 4.2) Reads from the same quorum NameNode HA Throughput: max latency of ack quorum bookies response
  • 12. HUAWEI TECHNOLOGIES CO., LTD. Page 12Page 12Page 12 NN HA Architecture with BK as shared storage NameNode HA Failover Controller Failover Controller Cmd Monitor NN health Write Block reports Read Monitor NN health BK 1 BK 2 BK n Namenode Active B K J M Namenode Standby B K J M BookKeeper Journal Manager (BKJM) is NameNode plugin implementation, involves BK client to read/write to/from BK cluster ZK 1 ZK 2 ZK n DN 1 DN 2 DN n
  • 13. HUAWEI TECHNOLOGIES CO., LTD. Page 13Page 13 NameNode HA Depends on ZooKeeper for leader election Monitors the health of Namenodes Invoke the RPC calls to NN to switch the state when leader election happens Perform leader election when health monitor find NN is bad Fencing: ZKFC will kill the remote NN before invoking switch commands to ensure single Active NN at any time. ZooKeeper Failover Controller (ZKFC) Namenode Monitor health Transiti on to Active/ Standby Namenode Monitor health Transiti on to Active/ Standby Fencing Failover Controlle r Failover Controller ZK 1 ZK 2 ZK n
  • 14. HUAWEI TECHNOLOGIES CO., LTD. Page 14 What if remote node is not accessible to kill? • Requires to write a fence method with shared storage. Note: Shared storage should support fencing for this purpose How BK based shared storage handles fencing? • BookKeeper allows single writer for a ledger “Eliminates the need of strict ZKFC level fencing” • BKJM solves the issue of concurrent ledger creations by two Active Namenodes - see HDFS-3452 Special fencing scenarios: NameNode HA
  • 15. HUAWEI TECHNOLOGIES CO., LTD. Page 15 Configurations  Namenode side: <property> <name>dfs.namenode.shared.edits.dir</name> <value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/haedit</value> </property> <property> <name>dfs.namenode.edits.journal-plugin.bookkeeper</name> <value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value> </property>  BookKeeper side: zkServers = zk1:2181,zk2:2181,zk3:2181 Bookies can start with remaining default configurations NameNode HA
  • 16. HUAWEI TECHNOLOGIES CO., LTD. Page 16 NameNode HA Other Enhancements/Work In Progress BOOKKEEPER-237 - Automatic recovery on Bookie failures. Importance – To avoid the data-loss when multiple Bookie crashes Entry read latency improvement with (round-robin + speculative) reads with less timeout - Avoid latencies in switch Already developed support for shared storage upgrades Further improvements and work progress is in HDFS-3399 Take a look at BookKeeper community work https://issues.apache.org/jira/browse/BOOKKEEPER
  • 17. HUAWEI TECHNOLOGIES CO., LTD. BookKeeper Auto Recovery (BOOKKEEPER-237)  Autorecovery is a new module in BookKeeper  Re-replicates the ledgers automatically when there are Bookie failures Watch on Available Bookies Detects the failed Bookie ledgers Add that ledgers ids in ZK Auditor: NameNode HA
  • 18. HUAWEI TECHNOLOGIES CO., LTD. Monitor for the ledger ids published by Auditor Scan and find the missed fragments from the ledger Initiates the re-replication Clean up the re-replicated ledger ids from ZK ReplicationWorker: NameNode HA BookKeeper Auto Recovery (BOOKKEEPER-237)
  • 19. HUAWEI TECHNOLOGIES CO., LTD. Testing Tested using HA cluster with HBase, MR, Hive, Huawei OM 500+ automated integration HA tests running in daily CI Verified the fault tolerant tests with BK recovery feature – see the list in HDFS-3399 Periodic kill node scenarios to verify the switch with BK – Did not see any data-loss In our test cluster, we did not see significant degradations and expected linear degradations when we increase quorum NameNode HA
  • 20. HUAWEI TECHNOLOGIES CO., LTD. Page 20 Any queries? NameNode HA