Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

1,727 views

Published on

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Published in: Technology
  • Be the first to comment

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

  1. 1. 2016/10/27 1 Kai Fukazawa, Yahoo Japan Corporation Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
  2. 2. Agenda 2 Hadoop and Related Network Yahoo! JAPAN’s Hadoop Network Transition Network Related Problems and Solutions  Network Related Problems  Network Requirements of The Latest Cluster  Adopted IP CLOS Network for Solving Problems Yahoo! JAPAN’s IP CLOS Network  Architecture  Performance Tests  New Problems Future Plan
  3. 3. Hadoop and Related Network
  4. 4. Hadoop and Related Network 4  Hadoop has various communication events  Heartbeat  Reports (Job/Block/Resource)  Block Data Transfer “HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016). “Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).
  5. 5. Hadoop and Related Network 5  Hadoop has various communication events  Heartbeat  Reports (Job/Block/Resource)  Block Data Transfer “HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project- dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016). “Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).
  6. 6. Hadoop and Related Network 6  Hadoop has various communication events  Heartbeat  Reports (Job/Block/Resource)  Block Data Transfer North/South
  7. 7. Hadoop and Related Network 7  Hadoop has various communication events  Heartbeat  Reports (Job/Block/Resource)  Block Data Transfer East/West
  8. 8. Hadoop and Related Network 8  Hadoop has various communication events  Heartbeat  Reports (Job/Block/Resource)  Block Data Transfer High Low
  9. 9. Hadoop and Related Network 9 “Introduction to Facebook‘s data center fabric”. (11/14/2014). Retrieved https://www.youtube.com/watch?v=mLEawo6OzFM. (10/06/2016).
  10. 10. Hadoop and Related Network 10  Oversubscription  commonly expressed as a ratio of the amount of desired bandwidth required versus bandwidth available 10Gbps 1Gbps NIC 40Nodes = 40Gbps Oversubscription 40 : 10 = 4 : 1 “Hadoop Operations by Eric Sammer (O’Reilly). Copyright 2012 Eric Sammer, 978-1-449-32705-7.”
  11. 11. Yahoo! JAPAN’s Hadoop Network Transition
  12. 12. 12 Yahoo! JAPAN’s Hadoop Network Transition 0 10 20 30 40 50 60 70 80 Cluster1 (Jun. 2011) Cluster2 (Jan. 2013) Cluster3 (Apr. 2014) Cluster4 (Dec. 2015) Cluster5 (Jun. 2016) PB Cluster Volume
  13. 13. 13 Yahoo! JAPAN’s Hadoop Network Transition Cluster1 Stack Architecture Nodes/Rack Server NIC UpLink Oversubscription
  14. 14. 14 Yahoo! JAPAN’s Hadoop Network Transition 20G Cluster1 4 Switches/Stack Stack Architecture Nodes/Rack Server NIC UpLink Oversubscription
  15. 15. 15 Yahoo! JAPAN’s Hadoop Network Transition Cluster1 Stack Architecture Nodes/Rack 90Nodes Server NIC 1Gbps UpLink Oversubscription
  16. 16. 16 Yahoo! JAPAN’s Hadoop Network Transition Cluster1 Stack Architecture Nodes/Rack 90Nodes Server NIC 1Gbps UpLink Oversubscription
  17. 17. 17 Yahoo! JAPAN’s Hadoop Network Transition Cluster1 Stack Architecture Nodes/Rack 90Nodes Server NIC 1Gbps UpLink 20Gbps Oversubscription20Gbps
  18. 18. 18 Yahoo! JAPAN’s Hadoop Network Transition 20Gbps Cluster1 Stack Architecture Nodes/Rack 90Nodes Server NIC 1Gbps UpLink 20Gbps Oversubscription 4.5 : 1
  19. 19. 19 Yahoo! JAPAN’s Hadoop Network Transition 20Gbps Cluster1 Stack Architecture Nodes/Rack 90Nodes Server NIC 1Gbps UpLink 20Gbps Oversubscription 4.5 : 1 Up to ~10 switches
  20. 20. 20 … Cluster2 Yahoo! JAPAN’s Hadoop Network Transition Spanning Tree Protocol Nodes/Rack Server NIC UpLink Oversubscription
  21. 21. 21 … Cluster2 Yahoo! JAPAN’s Hadoop Network Transition Spanning Tree Protocol Nodes/Rack 40Nodes Server NIC 1Gbps UpLink Oversubscription
  22. 22. 22 Yahoo! JAPAN’s Hadoop Network Transition … Cluster2 Spanning Tree Protocol Nodes/Rack 40Nodes Server NIC 1Gbps UpLink Oversubscription
  23. 23. 23 Yahoo! JAPAN’s Hadoop Network Transition … Cluster2 Spanning Tree Protocol Nodes/Rack 40Nodes Server NIC 1Gbps UpLink 10Gbps Oversubscription10Gbps
  24. 24. 24 Yahoo! JAPAN’s Hadoop Network Transition … Cluster2 Spanning Tree Protocol Nodes/Rack 40Nodes Server NIC 1Gbps UpLink 10Gbps Oversubscription 4 : 110Gbps
  25. 25. 25 Yahoo! JAPAN’s Hadoop Network Transition … Cluster2 Spanning Tree Protocol Nodes/Rack 40Nodes Server NIC 1Gbps UpLink 10Gbps Oversubscription 4 : 1Blocking
  26. 26. 26 L2 Fabric … Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack Server NIC UpLink Oversubscription Cluster3
  27. 27. 27 L2 Fabric … Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack 40Nodes Server NIC 1Gbps UpLink Oversubscription Cluster3
  28. 28. 28 L2 Fabric … Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack 40Nodes Server NIC 1Gbps UpLink Oversubscription Cluster3
  29. 29. 29 Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack 40Nodes Server NIC 1Gbps UpLink 20Gbps Oversubscription L2 Fabric … Cluster3 20Gbps 20Gbps
  30. 30. 30 Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack 40Nodes Server NIC 1Gbps UpLink 20Gbps Oversubscription 2 : 1 L2 Fabric … Cluster3 20Gbps 20Gbps
  31. 31. 31 Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack Server NIC UpLink Oversubscription L2 Fabric … Cluster4
  32. 32. 32 Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack 16Nodes Server NIC 10Gbps UpLink Oversubscription L2 Fabric … Cluster4
  33. 33. 33 Yahoo! JAPAN’s Hadoop Network Transition L2 Fabric/Channel Nodes/Rack 16Nodes Server NIC 10Gbps UpLink 80Gbps Oversubscription 2 : 1 L2 Fabric … 80Gbps 80Gbps Cluster4
  34. 34. 34 Yahoo! JAPAN’s Hadoop Network transition Release Volume #Nodes/Switch NIC Oversubscription Cluster1 3PByte 90 1Gbps 4.5:1 Cluster2 20PByte 40 1Gbps 4:1 Cluster3 38PByte 40 1Gbps 2:1 Cluster4 58PByte 16 10Gbps 2:1 Cluster5 75PByte ? ?Gbps ?:?
  35. 35. Network Related Problems And Solutions
  36. 36. Network Related Problems 36  Effect of switch failure in the Stack Architecture  Load on the switch due to BUM Traffic  Limitations for the DataNode Decommission  Limitations for the Scale-out
  37. 37. 37 Effect of switch failure in the Stack Architecture  One of the switches which formed the Stack failed  This affected the other switches forming the same Stack  Communication interruption among 90 nodes(5 racks)  insufficient computing resources and processing stoppage Network Related Problems
  38. 38. 38 Load on the switch due to BUM Traffic L2 Fabric … … 4400Nodes  Due to ARP traffic from servers, load on the core switch CPU increases  Tuning of ARP Cache entry timeout  The problem is Large Network Address Network Related Problems
  39. 39. 39 Limitations for the DataNode Decommission Network Related Problems  Consideration of the impact on jobs  Limiting the number of nodes for Decommissioning
  40. 40. 40 Limitations for the Scale-out  Stack Architecture  Up to ~10 switches  L2 Fabric Architecture  Depending on the number of chassis Network Related Problems
  41. 41. 41 Requirements 120~200 Racks Scale-out possible up to 10000 Nodes 100~200Gbps UpLink/Rack 10Gbps NIC Server 20Nodes/Rack DataCenter Located in US Network Requirements of The Latest Cluster
  42. 42. 42 How to solve these problems?
  43. 43. 43 How to solve these problems? We adopted IP CLOS Network!
  44. 44. Adopted IP CLOS Network For Solving Problems 44 Google, Facebook, Amazon, Yahoo… Over The Top have adopted DC network architecture “Introducing data center fabric, the next-generation Facebook data center network”. Facebook Code. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the- next-generation-facebook-data-center-network/. (10/06/2016).
  45. 45. Adopted IP CLOS Network For Solving Problems 45 Improved scalability Improved high availability Cope-Up with increase in East-West traffic Reduction in operating cost
  46. 46. Yahoo! JAPAN’s IP CLOS Network
  47. 47. 47 BoxSwitch Architecture  No limitation on Scale-out  Requires many switches ・・・ ・・ ・・・ ・・ ・・・ ・・ ・・・ ・・ ・・ ・・ ・・ ・・・・・ Spine Leaf ToR Architecture
  48. 48. 48 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf Architecture
  49. 49. Architecture 49 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf ・・・ ・・Spine Leaf
  50. 50.  Why was this architecture adopted?  Reduce in items to be managed IP address and cable, Interface, BGP Neighbor…..  Overcomes the physical constraints, such as one floor limit  Reduction in cost Architecture
  51. 51. ECMP Between Spine and Leaf is BGP 51 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf BGP Architecture
  52. 52. 52 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf /31 /26 /27 Architecture Between Spine and Leaf : /31 Rack : /26, /27
  53. 53. 53 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf /31 /26 /27 Architecture Resolved the “BUM Traffic problem”
  54. 54. 54 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf Leaf Uplink 40Gbps x 4 = 160Gbps 160Gbps ① ② ③ ④ Architecture
  55. 55. 55 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf Leaf Uplink 40Gbps x 4 = 160Gbps ① ② ③ ④ Architecture 10Gbps NIC 20Nodes 160Gbps
  56. 56. 56 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf Leaf Uplink 40Gbps x 4 = 160Gbps 160G ① ② ③ ④ Architecture 200 : 160 = 1.25 : 1 10Gbps NIC 20Nodes
  57. 57. 57 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf Leaf Uplink 40Gbps x 4 = 160Gbps 160G ① ② ③ ④ Architecture 200 : 160 = 1.25 : 1 Resolved the “Limitations for the DataNode Decommission” 10Gbps NIC 20Nodes
  58. 58. 58 ・・・・・ Internet Spine Core Router Layer3 Layer2・・・・・ Leaf Leaf Uplink 40Gbps x 4 = 160Gbps 160G ① ② ③ ④ Architecture 200 : 160 = 1.25 : 1Improved High Availability 10Gbps NIC 20Nodes
  59. 59. Architecture 59  Effect of switch failure in the Stack Architecture  Load on the switch due to BUM Traffic  Limitations for the DataNode Decommission  Limitations for the Scale-out
  60. 60. Architecture 60  Effect of switch failure in the Stack Architecture  Load on the switch due to BUM Traffic  Limitations for the DataNode Decommission  Limitations for the Scale-out ✔ ✔ ✔
  61. 61. 61 Yahoo! JAPAN’s Hadoop Network transition Release Volume #Nodes/Switch NIC Oversubscription Cluster1 3PByte 90 1Gbps 4.5:1 Cluster2 20PByte 40 1Gbps 4:1 Cluster3 38PByte 40 1Gbps 2:1 Cluster4 58PByte 16 10Gbps 2:1 Cluster5 75PByte 20 10Gbps 1.25:1
  62. 62. Performance Tests(5TB Terasort) 62
  63. 63. 63 Performance Tests(40TB DistCp)
  64. 64. 64 Performance Tests(40TB DistCp) 16Nodes/Rack 8Gbps/Node
  65. 65. 65 Performance Tests(40TB DistCp) 16Nodes/Rack 8Gbps/Node About 30Gbps x 4 = 120Gbps
  66. 66. New Problems 66  Delay in data transfer  Out of 4, 1 error packet is generated in Uplink  That one affected the data transfer delay Slow
  67. 67. New Problems 67  Delay in data transfer  Out of 4, 1 error packet is generated in Uplink  That one affected the data transfer delay “org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror” Slow
  68. 68. New Problems 68  Delay in data transfer  Out of 4, 1 error packet is generated in Uplink  That one affected the data transfer delay “org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror” Slow
  69. 69. New Problems 69  IP changes when the server rack changes  Also has a network address for each rack  Access control using IP address  Requires ACL update according to relocation 192.168.0.0/26 192.168.0.64/26 192.168.0.10 192.168.0.100
  70. 70. Future Plan
  71. 71. Future Plan 71  Detecting error packet failure before affecting the data transfer Error!
  72. 72. Future Plan 72 Error! Auto Shutdown  Detecting error packet failure before affecting the data transfer
  73. 73. Future Plan 73  Use Erasure Coding striping 64kB Originalrawdata
  74. 74. Future Plan 74  Use Erasure Coding D6 striping 64kB Originalrawdata Raw data D5D4D3D2D1
  75. 75. Future Plan 75  Use Erasure Coding D6 striping 64kB Originalrawdata Parity Raw data D5D4D3D2D1 P3P2P1
  76. 76. Future Plan 76  Use Erasure Coding D6 striping 64kB Originalrawdata Parity Raw data D5D4D3D2D1 P3P2P1 D6 D5 D4 D3 D2 D1 P1 P2 P3
  77. 77. Future Plan 77  Use Erasure Coding D6 striping 64kB Originalrawdata Parity Raw data D5D4D3D2D1 P3P2P1 D6 D5 D4 D3 D2 D1 P1 P2 P3 Read
  78. 78. Future Plan 78  Use Erasure Coding D6 striping 64kB Originalrawdata Parity Raw data D5D4D3D2D1 P3P2P1 D6 D5 D4 D3 D2 D1 P1 P2 P3 Read
  79. 79. Future Plan 79  Use Erasure Coding D6 striping 64kB Originalrawdata Parity Raw data D5D4D3D2D1 P3P2P1 D6 D5 D4 D3 D2 D1 P1 P2 P3 Low Data Locality
  80. 80. Future Plan 80 ・・・・・・・・・・・・ Interconnecting various platforms … … BOTTLENECK
  81. 81. Future Plan 81 ・・・・・・・・・・・・・・  Isolation of computing and storage : Storage Machine : Computing Machine
  82. 82. Thank You for Listening!
  83. 83. Appendix
  84. 84. Appendix 84 JANOG38 http://www.janog.gr.jp/meeting/janog38/program/clos

×