Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

553 views

Published on

LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

  1. 1. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 2020/09/17 Akira Ajisaka Upgrading HDFS to 3.3.0 and deploying RBF in production LINE Developer Meetup #68 – Big Data Platform
  2. 2. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Self introduction 2 • Akira Ajisaka (鯵坂 明, Twitter: @ajis_ka) • Apache Hadoop PMC member (2016~) • Yahoo! JAPAN (2018~) Outdoor bouldering for the first time in Mitake
  3. 3. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Agenda 3 • Why and how we upgraded the largest HDFS cluster to 3.3.0 • Hadoop clusters in Yahoo! JAPAN • Short intro of RBF and why we choose it • How to upgrade • How to split namespace • What we considered and experimented • Many troubles and lessons learned from them
  4. 4. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Why and how we upgraded the cluster?
  5. 5. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Yahoo! JAPAN's largest HDFS cluster 5 • 100PB actual used • 500+ DataNodes • 240M files + directories • 290M blocks • 400GB NameNode Java heap • HDP 2.6.x + patches (as of Dec. 2019) Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc
  6. 6. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Major existing problems 6 • The namespace is too large • NameNode does not scale infinitely due to heavy GC • The Hadoop version is too old • HDP 2.6 is based on Apache Hadoop 2.7.3 • 2.7.3 was released 4 years ago • We upgraded to HDFS 3.3.0 and use RBF to split the namespace
  7. 7. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. RBF (Router-based Federation) 7 / top/ shp/ auc/ Namespace Namespace Namespace NameNode NameNode NameNode ZooKeeper StateStore DFSRouter Note: Kerberos authentication is supported in Hadoop 3.3.0
  8. 8. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. How to enable RBF w/o clients' config changes 8 NameNode @ host1 (port 8021) NameNode @ host2 NameNode @ host3 ZooKeeper StateStore DFSRouter @ host1 (port 8020)NameNode @ host1 (port 8020) Before After Note: We couldn't rolling upgrade the cluster because of the NN RPC port change
  9. 9. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. How to split namespaces 9 • Calculated # of files/directories/blocks from fsimage • Calculated # of RPCs from audit logs • RPCs are classified into two groups (update/read) • We had to check audit logs to ensure that there is no rename operation between namespaces • RBF does not support it for now • Xiaomi has developed HDFS Federation Rename (HFR) • https://issues.apache.org/jira/browse/HDFS-15087 (work in progress)
  10. 10. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Split DataNodes or not? 10 Split DataNodes for each namespace (no-split) DNs register all the NameNodes NN DN NN DN We chose splitting DNs because it is simple
  11. 11. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Split DataNodes – Pros and Cons 11 Pros • Simple • Easy to troubleshoot, operate • No limitation of the # of namespaces • East-west traffic can be controlled easily Cons • Need to calculate how many DNs required for each namespaces • Possible unbalanced resource usage among namespaces • HFR uses hard-link for rename and it assumes non-split DNs
  12. 12. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Check HDFS client-server compatibility 12 • We upgrade HDFS only • Old (HDP 2.6) clients still exist, so we have to check the compatibility • We read ".proto" files and verified that • In addition, upgraded HDFS in development cluster for end-users • Wrote a blog post: https://techblog.yahoo.co.jp/entry/20191206 786320/ (Japanese and English)
  13. 13. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. • If a client is configured as follows, the client always connects to host1 • To avoid this problem, set "dfs.client.failover.random.order" to true • This feature is available in Hadoop 2.9.0 and not available in the old clients, so we patched internally • The default value is true in Hadoop 3.4.0+ (HDFS-15350) Load-balancing DFSRouters 13 <property name="dfs.nameservices" value="ns"/> <property name="dfs.ha.namenodes.ns" value="dr1,dr2"/> <property name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/> <property name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>
  14. 14. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Try Java 11 14 • Hadoop 3.3.0 supports Java 11 as runtime • Upgrade to Java 11 to improve GC performance • We contributed many patches to support Java 11 in Apache Hadoop community • https://www.slideshare.net/techblogyahoo/jav a11-apache-hadoop-146834504 (Japanese)
  15. 15. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Upgrade ZooKeeper to 3.5.x 15 • Error log w/ Hadoop 3.3.0 and ZK 3.4.x • Hadoop 3.3.0 upgraded Curator version and it depends on ZooKeeper 3.5.x (HADOOP-16579) • Rolling upgraded ZK cluster before upgrading HDFS • Upgrade succeeded without any major problems (snip) Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) (snip)
  16. 16. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Planned schedule 16 • 2019.9 Upgraded to trunk in the dev cluster • 2020.3 Apache Hadoop 3.3.0 released • 2020.3 Upgraded to 3.3.0 in the staging cluster • 2020.5 Upgraded to 3.3.0 in production
  17. 17. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Actual schedule 17 • 2019.9 Upgraded to trunk in the dev cluster (with 1 retries) • 2020.7 Apache Hadoop 3.3.0 released • 2020.8 Upgraded to 3.3.0 in the staging cluster (with 2 retries) • 2020.8 Upgraded to 3.3.0 in production (no retry! but faced many troubles...) • Upgrade is completed remotely
  18. 18. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Many troubles
  19. 19. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. DistCp is slower than expected 19 • We used DistCp to move recent data between namespaces after upgrade but it didn't finished by deadline • Directory listing of src/dst is serial • Increasing Map tasks does not help • DistCp always fails if (# of Map tasks) > 200 and dynamic option is true • Fails by configuration error • To make matters worse, it fails after directory listing, which takes very long time • DistCp does not work well for very large directory • Recommend splitting the job
  20. 20. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. • We faced many job failures just after the upgrade • When splitting DNs, we considered only the data size but it is not sufficient • Read/write request must be considered as well DN traffic reached the NW bandwidth limit 20 DN out traffic in a subcluster 25Gbps
  21. 21. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. DFSRouter slowdown 21 • DFSRouter drastically slowdown when restarting active NameNode • Wrote a patch and fixed in HDFS-15555 DFSRouter Average RPC Queue time 30 sec Finished loading fsimage Restarted active NameNode
  22. 22. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. HttpFS incompatibilities 22 • The implementation of the web server is different • Hadoop 2.x: Tomcat 6.x • Hadoop 3.x: Jetty 9.x • The behavior is very different • Jetty supports HTTP/1.1 (chunked encoding) • Default idle timeout is different • Tomcat: 60 seconds • Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second) • Response flow (what timing the server returns 401) is different • Response body itself is different • and more... • Need to test very carefully if you are using HttpFS
  23. 23. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Lessons learned 23 • We have changed many configurations at a time, but should be avoided as possible • For example, we changed block placement policy to rack fault-tolerant and under-replicated blocks become 300M+ after upgrade • Trouble shooting become more difficult • HttpFS upgrades can be also separated from this upgrade, as well as ZooKeeper • Imagine what will happen in production and test them as possible in advance • Consider the difference between dev/staging and prod • There is a limit one people can imagine. Ask many colleagues!
  24. 24. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. HDFS Future works 24 • Router-based Federation • Rebalance DNs/namespaces between subclusters well • Considering multiple subclusters, non-split DNs (or even in hybrid), HFR, and so on • Erasure Coding in production • Internally backporting EC feature to the old HDFS client and the work mostly finished • Try new low-pause-time GC algorithms • ZGC, Shenandoah
  25. 25. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. We are hiring! 25 https://about.yahoo.co.jp/hr/job-info/role/1247/ (Japanese)

×