SlideShare a Scribd company logo
HUAWEI TECHNOLOGIES CO., LTD.
www.huawei.com
Namenode High Availability
Uma Maheswara Rao G
maheswara@huawei.com
umamahesh@apache.org
HUAWEI TECHNOLOGIES CO., LTD. Page 2
Who am I
R&D Tech Lead in Huawei
Apache Hadoop Committer at ASF
Active Contributor to Apache BookKeeper at ASF
Main development focus on HDFS
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 3
Huawei Hadoop R&D – In a Glance
• Secondary Index in HBase
• HDFS NN HA (Hadoop-2)
• Bookkeeper as shared storage for NN HA (Hadoop-2)
• HDFS NN HA (Hadoop-1)
• MapReduce ResourceManager HA (Hadoop-2 / YARN)
• MapReduce JobTracker HA (Hadoop-1)
• Hive HA
Hadoop Development
• Raised over 650 defects since Jan’11
• Fixed over 500 defects since Jan’11, and contributed back
to community.
Stabilization
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 4
In 2011, we implemented HA in Hadoop 0.20.1, based on
Backup Namenode(BNN) and ZooKeeper at Huawei
 Intelligent clients - find active NN from configured NNs
 Streaming edits to the BNN
 Sending block reports to both active NN and BNN
 BNN does periodic checkpoints
 ZK based leader election
Achieved hot standby and Automatic Switch
However, BNN based solution did not address double failure
scenario
Namenode High Availability
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 5
What is Double Failure here?
Active NN Backup NN
Active NN Start - Active NN
Checkpoint – t1
Checkpoint – t2
NN-1 NN-2
NN-1 NN-2
X - Operation logs in NN1 and could not stream to
BNN (NN2)
Not aware of X - Operation logs
from NN1. DataLoss!!!
 NN1 is active and NN2 is Backup node
 NN1 failed to stream edits to the Backup node and removed the stream
 NN1 received X- operation edit logs, where NN2 not aware of them
 Now NN1 crashed, NN2 becomes active
 NN2 continues to work but not aware of the edits after checkpoint:t1
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 6
In 2011 - 2012, Community started discussion on Shared storage
approach
 HDFS-1623 : Discussed double failure issue
 Store the edit logs at a common place and share to both the
Namenodes
Huawei collaborated with community in HDFS-1623
development
Community discussions on double failure
Constraints with the HDFS-1623:
The shared storage is an SPOF. A highly available shared storage is
required.
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 7Page 7
NameNode HA Architecture
Both NameNodes start as
standby and try loading the edits
Namenode
Standby
Failover
Controller
Namenode
Active
Failover
Controller
Shared storage
DN 1 DN 2 DN n
ZK 1 ZK 2 ZK n
Cmd
Monitor
NN
health
Namenode
Standby
Write
Block reports
Read Read
Monitor
NN
health
New process FailOver
Controller (ZKFC) responsible
for monitoring and failover
On ZK based leader election
ZKFC issue command to its local
NN to become Active or
Standby
Only Active NN will write to
shared storage, Standby NN will
read from it.
All DataNodes will send block
reports to both Namenodes to
achieve hot Standby.
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 8
Shared storage options
NFS: May not be a better fit for many deployments as NFS is an
external device, costly, less control on timeouts etc.
QuorumJournalManager(QJM): Newly implemented in HDFS. It did
not come out from any of the Apache releases yet.
BoookKeeperJournalManager(BKJM): Integrated BookKeeper based
shared storage to HDFS. Available in Hadoop-2 onwards.
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 9
 BookKeeper is mainly inspired by Namenode. Can work as shared storage with HA
Namenode
 Can Support Federated Namenodes
 Dynamic scaling up/down of storage nodes(BK)
– Useful when HBase, Federated NN also uses it.
 Multiple disk support - Reduces the disk seeks
 Round Robin reads from quorum - distributes the load
 HDFS-234 - BookKeeper based Journal manager
- Available in Hadoop-2 releases and trunk
 Quorum based commits in trunk
– Parallel writes to write-quorum Bookies and wait for ack-quorum (see
BOOKKEEPER-208)
NameNode HA
Motivations to use BookKeeper as shared storage
Contd..
HUAWEI TECHNOLOGIES CO., LTD. Page 10
Motivations to use BookKeeper as shared storage
 BookKeeper is in production use at Yahoo! for guaranteed delivery of log
messages to Hedwig Servers
 Code was there for more than a year in Apache and has active
contributors and committers list
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 11
BookKeeper at glance
Bookie: Storage node
Ledger: log file
Ensemble: group of bookies
storing a ledger
Writes to quorums of Bookies
Parallel writes to quorums
Wait for ack quorum (in BK 4.2)
Reads from the same quorum
NameNode HA
Throughput: max latency of ack quorum bookies response
HUAWEI TECHNOLOGIES CO., LTD. Page 12Page 12Page 12
NN HA Architecture with BK as shared storage
NameNode HA
Failover
Controller
Failover
Controller
Cmd
Monitor
NN
health
Write
Block reports
Read
Monitor
NN
health
BK 1 BK 2 BK n
Namenode
Active
B
K
J
M
Namenode
Standby
B
K
J
M
BookKeeper Journal Manager
(BKJM) is NameNode plugin
implementation, involves BK
client to read/write to/from BK
cluster
ZK 1 ZK 2 ZK n
DN 1 DN 2 DN n
HUAWEI TECHNOLOGIES CO., LTD. Page 13Page 13
NameNode HA
Depends on ZooKeeper for leader
election
Monitors the health of Namenodes
Invoke the RPC calls to NN to switch
the state when leader election
happens
Perform leader election when health
monitor find NN is bad
Fencing: ZKFC will kill the remote
NN before invoking switch
commands to ensure single Active
NN at any time.
ZooKeeper Failover Controller (ZKFC)
Namenode
Monitor
health
Transiti
on to
Active/
Standby
Namenode
Monitor
health
Transiti
on to
Active/
Standby
Fencing
Failover
Controlle
r
Failover
Controller
ZK
1
ZK
2
ZK
n
HUAWEI TECHNOLOGIES CO., LTD. Page 14
What if remote node is not accessible to kill?
• Requires to write a fence method with shared storage.
Note: Shared storage should support fencing for this purpose
How BK based shared storage handles fencing?
• BookKeeper allows single writer for a ledger
“Eliminates the need of strict ZKFC level fencing”
• BKJM solves the issue of concurrent ledger creations by two Active
Namenodes - see HDFS-3452
Special fencing scenarios:
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 15
Configurations
 Namenode side:
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/haedit</value>
</property>
<property>
<name>dfs.namenode.edits.journal-plugin.bookkeeper</name>
<value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value>
</property>
 BookKeeper side:
zkServers = zk1:2181,zk2:2181,zk3:2181
Bookies can start with remaining default configurations
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 16
NameNode HA
Other Enhancements/Work In Progress
BOOKKEEPER-237 - Automatic recovery on Bookie failures.
Importance – To avoid the data-loss when multiple Bookie crashes
Entry read latency improvement with (round-robin + speculative) reads with less
timeout - Avoid latencies in switch
Already developed support for shared storage upgrades
Further improvements and work progress is in HDFS-3399
Take a look at BookKeeper community work
https://issues.apache.org/jira/browse/BOOKKEEPER
HUAWEI TECHNOLOGIES CO., LTD.
BookKeeper Auto Recovery (BOOKKEEPER-237)
 Autorecovery is a new module in BookKeeper
 Re-replicates the ledgers automatically when there are Bookie failures
Watch on Available Bookies
Detects the failed Bookie
ledgers
Add that ledgers ids in ZK
Auditor:
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD.
Monitor for the ledger ids
published by Auditor
Scan and find the missed
fragments from the ledger
Initiates the re-replication
Clean up the re-replicated ledger
ids from ZK
ReplicationWorker:
NameNode HA
BookKeeper Auto Recovery (BOOKKEEPER-237)
HUAWEI TECHNOLOGIES CO., LTD.
Testing
Tested using HA cluster with HBase, MR, Hive, Huawei OM
500+ automated integration HA tests running in daily CI
Verified the fault tolerant tests with BK recovery feature – see the
list in HDFS-3399
Periodic kill node scenarios to verify the switch with BK – Did not
see any data-loss
In our test cluster, we did not see significant degradations and
expected linear degradations when we increase quorum
NameNode HA
HUAWEI TECHNOLOGIES CO., LTD. Page 20
Any queries?
NameNode HA
Thank you
www.huawei.com

More Related Content

What's hot

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
HBase replication
HBase replicationHBase replication
HBase replication
wchevreuil
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
Newton Alex
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
DataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
Michael Stack
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
Newton Alex
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
Manish Chopra
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big data
Stéphane Heckel
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
Ceph Community
 
Chef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleChef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scale
Biju Nair
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
Mike Pittaro
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 

What's hot (20)

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
HBase replication
HBase replicationHBase replication
HBase replication
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
 
OpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of AlabamaOpenStack and Ceph case study at the University of Alabama
OpenStack and Ceph case study at the University of Alabama
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ TwitterCross-DC Fault-Tolerant ViewFileSystem @ Twitter
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big data
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Chef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scaleChef conf-2015-chef-patterns-at-bloomberg-scale
Chef conf-2015-chef-patterns-at-bloomberg-scale
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Similar to Apache HDFS High Availability

State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
inside-BigData.com
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
Edureka!
 
HA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedHA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and Keepalived
Ganapathi Kandaswamy
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAYthevijayps
 
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Edureka!
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
Robert Lemke
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
Edureka!
 
The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski
QAware GmbH
 
Docker for developers on mac and windows
Docker for developers on mac and windowsDocker for developers on mac and windows
Docker for developers on mac and windows
Docker, Inc.
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Community
 
Hadoop ha system admin
Hadoop ha system adminHadoop ha system admin
Hadoop ha system admin
Trieu Dao Minh
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Athens Big Data
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent data
Docker, Inc.
 
DockerCon 18 docker storage
DockerCon 18 docker storageDockerCon 18 docker storage
DockerCon 18 docker storage
Daniel Finneran
 

Similar to Apache HDFS High Availability (20)

State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
HA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and KeepalivedHA Deployment Architecture with HAProxy and Keepalived
HA Deployment Architecture with HAProxy and Keepalived
 
HDFS presented by VIJAY
HDFS presented by VIJAYHDFS presented by VIJAY
HDFS presented by VIJAY
 
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski The Big Cloud native FaaS Lebowski
The Big Cloud native FaaS Lebowski
 
Docker for developers on mac and windows
Docker for developers on mac and windowsDocker for developers on mac and windows
Docker for developers on mac and windows
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
 
Hadoop ha system admin
Hadoop ha system adminHadoop ha system admin
Hadoop ha system admin
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent data
 
DockerCon 18 docker storage
DockerCon 18 docker storageDockerCon 18 docker storage
DockerCon 18 docker storage
 

Recently uploaded

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Apache HDFS High Availability

  • 1. HUAWEI TECHNOLOGIES CO., LTD. www.huawei.com Namenode High Availability Uma Maheswara Rao G maheswara@huawei.com umamahesh@apache.org
  • 2. HUAWEI TECHNOLOGIES CO., LTD. Page 2 Who am I R&D Tech Lead in Huawei Apache Hadoop Committer at ASF Active Contributor to Apache BookKeeper at ASF Main development focus on HDFS NameNode HA
  • 3. HUAWEI TECHNOLOGIES CO., LTD. Page 3 Huawei Hadoop R&D – In a Glance • Secondary Index in HBase • HDFS NN HA (Hadoop-2) • Bookkeeper as shared storage for NN HA (Hadoop-2) • HDFS NN HA (Hadoop-1) • MapReduce ResourceManager HA (Hadoop-2 / YARN) • MapReduce JobTracker HA (Hadoop-1) • Hive HA Hadoop Development • Raised over 650 defects since Jan’11 • Fixed over 500 defects since Jan’11, and contributed back to community. Stabilization NameNode HA
  • 4. HUAWEI TECHNOLOGIES CO., LTD. Page 4 In 2011, we implemented HA in Hadoop 0.20.1, based on Backup Namenode(BNN) and ZooKeeper at Huawei  Intelligent clients - find active NN from configured NNs  Streaming edits to the BNN  Sending block reports to both active NN and BNN  BNN does periodic checkpoints  ZK based leader election Achieved hot standby and Automatic Switch However, BNN based solution did not address double failure scenario Namenode High Availability NameNode HA
  • 5. HUAWEI TECHNOLOGIES CO., LTD. Page 5 What is Double Failure here? Active NN Backup NN Active NN Start - Active NN Checkpoint – t1 Checkpoint – t2 NN-1 NN-2 NN-1 NN-2 X - Operation logs in NN1 and could not stream to BNN (NN2) Not aware of X - Operation logs from NN1. DataLoss!!!  NN1 is active and NN2 is Backup node  NN1 failed to stream edits to the Backup node and removed the stream  NN1 received X- operation edit logs, where NN2 not aware of them  Now NN1 crashed, NN2 becomes active  NN2 continues to work but not aware of the edits after checkpoint:t1 NameNode HA
  • 6. HUAWEI TECHNOLOGIES CO., LTD. Page 6 In 2011 - 2012, Community started discussion on Shared storage approach  HDFS-1623 : Discussed double failure issue  Store the edit logs at a common place and share to both the Namenodes Huawei collaborated with community in HDFS-1623 development Community discussions on double failure Constraints with the HDFS-1623: The shared storage is an SPOF. A highly available shared storage is required. NameNode HA
  • 7. HUAWEI TECHNOLOGIES CO., LTD. Page 7Page 7 NameNode HA Architecture Both NameNodes start as standby and try loading the edits Namenode Standby Failover Controller Namenode Active Failover Controller Shared storage DN 1 DN 2 DN n ZK 1 ZK 2 ZK n Cmd Monitor NN health Namenode Standby Write Block reports Read Read Monitor NN health New process FailOver Controller (ZKFC) responsible for monitoring and failover On ZK based leader election ZKFC issue command to its local NN to become Active or Standby Only Active NN will write to shared storage, Standby NN will read from it. All DataNodes will send block reports to both Namenodes to achieve hot Standby. NameNode HA
  • 8. HUAWEI TECHNOLOGIES CO., LTD. Page 8 Shared storage options NFS: May not be a better fit for many deployments as NFS is an external device, costly, less control on timeouts etc. QuorumJournalManager(QJM): Newly implemented in HDFS. It did not come out from any of the Apache releases yet. BoookKeeperJournalManager(BKJM): Integrated BookKeeper based shared storage to HDFS. Available in Hadoop-2 onwards. NameNode HA
  • 9. HUAWEI TECHNOLOGIES CO., LTD. Page 9  BookKeeper is mainly inspired by Namenode. Can work as shared storage with HA Namenode  Can Support Federated Namenodes  Dynamic scaling up/down of storage nodes(BK) – Useful when HBase, Federated NN also uses it.  Multiple disk support - Reduces the disk seeks  Round Robin reads from quorum - distributes the load  HDFS-234 - BookKeeper based Journal manager - Available in Hadoop-2 releases and trunk  Quorum based commits in trunk – Parallel writes to write-quorum Bookies and wait for ack-quorum (see BOOKKEEPER-208) NameNode HA Motivations to use BookKeeper as shared storage Contd..
  • 10. HUAWEI TECHNOLOGIES CO., LTD. Page 10 Motivations to use BookKeeper as shared storage  BookKeeper is in production use at Yahoo! for guaranteed delivery of log messages to Hedwig Servers  Code was there for more than a year in Apache and has active contributors and committers list NameNode HA
  • 11. HUAWEI TECHNOLOGIES CO., LTD. Page 11 BookKeeper at glance Bookie: Storage node Ledger: log file Ensemble: group of bookies storing a ledger Writes to quorums of Bookies Parallel writes to quorums Wait for ack quorum (in BK 4.2) Reads from the same quorum NameNode HA Throughput: max latency of ack quorum bookies response
  • 12. HUAWEI TECHNOLOGIES CO., LTD. Page 12Page 12Page 12 NN HA Architecture with BK as shared storage NameNode HA Failover Controller Failover Controller Cmd Monitor NN health Write Block reports Read Monitor NN health BK 1 BK 2 BK n Namenode Active B K J M Namenode Standby B K J M BookKeeper Journal Manager (BKJM) is NameNode plugin implementation, involves BK client to read/write to/from BK cluster ZK 1 ZK 2 ZK n DN 1 DN 2 DN n
  • 13. HUAWEI TECHNOLOGIES CO., LTD. Page 13Page 13 NameNode HA Depends on ZooKeeper for leader election Monitors the health of Namenodes Invoke the RPC calls to NN to switch the state when leader election happens Perform leader election when health monitor find NN is bad Fencing: ZKFC will kill the remote NN before invoking switch commands to ensure single Active NN at any time. ZooKeeper Failover Controller (ZKFC) Namenode Monitor health Transiti on to Active/ Standby Namenode Monitor health Transiti on to Active/ Standby Fencing Failover Controlle r Failover Controller ZK 1 ZK 2 ZK n
  • 14. HUAWEI TECHNOLOGIES CO., LTD. Page 14 What if remote node is not accessible to kill? • Requires to write a fence method with shared storage. Note: Shared storage should support fencing for this purpose How BK based shared storage handles fencing? • BookKeeper allows single writer for a ledger “Eliminates the need of strict ZKFC level fencing” • BKJM solves the issue of concurrent ledger creations by two Active Namenodes - see HDFS-3452 Special fencing scenarios: NameNode HA
  • 15. HUAWEI TECHNOLOGIES CO., LTD. Page 15 Configurations  Namenode side: <property> <name>dfs.namenode.shared.edits.dir</name> <value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/haedit</value> </property> <property> <name>dfs.namenode.edits.journal-plugin.bookkeeper</name> <value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value> </property>  BookKeeper side: zkServers = zk1:2181,zk2:2181,zk3:2181 Bookies can start with remaining default configurations NameNode HA
  • 16. HUAWEI TECHNOLOGIES CO., LTD. Page 16 NameNode HA Other Enhancements/Work In Progress BOOKKEEPER-237 - Automatic recovery on Bookie failures. Importance – To avoid the data-loss when multiple Bookie crashes Entry read latency improvement with (round-robin + speculative) reads with less timeout - Avoid latencies in switch Already developed support for shared storage upgrades Further improvements and work progress is in HDFS-3399 Take a look at BookKeeper community work https://issues.apache.org/jira/browse/BOOKKEEPER
  • 17. HUAWEI TECHNOLOGIES CO., LTD. BookKeeper Auto Recovery (BOOKKEEPER-237)  Autorecovery is a new module in BookKeeper  Re-replicates the ledgers automatically when there are Bookie failures Watch on Available Bookies Detects the failed Bookie ledgers Add that ledgers ids in ZK Auditor: NameNode HA
  • 18. HUAWEI TECHNOLOGIES CO., LTD. Monitor for the ledger ids published by Auditor Scan and find the missed fragments from the ledger Initiates the re-replication Clean up the re-replicated ledger ids from ZK ReplicationWorker: NameNode HA BookKeeper Auto Recovery (BOOKKEEPER-237)
  • 19. HUAWEI TECHNOLOGIES CO., LTD. Testing Tested using HA cluster with HBase, MR, Hive, Huawei OM 500+ automated integration HA tests running in daily CI Verified the fault tolerant tests with BK recovery feature – see the list in HDFS-3399 Periodic kill node scenarios to verify the switch with BK – Did not see any data-loss In our test cluster, we did not see significant degradations and expected linear degradations when we increase quorum NameNode HA
  • 20. HUAWEI TECHNOLOGIES CO., LTD. Page 20 Any queries? NameNode HA