Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合

588 views

Published on

第2回 HPC OPS研究会スライド
https://bit.riken.jp/2018/06/2nd-hpc-ops-mtg/

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合

  1. 1. / C 2 7 1 : 1 8 0
  2. 2. 2 . , , AI
  3. 3. 3 O c x Z / 6 o Z t k N x v Z N G E N G E / v Z dfl /KIJH N )6 tk e fae O /BEKM 6 E BEB kn O 6 R O fn Oe 8 R X g OiO d N/BEKMb lp JA HE J kn O O rOo fn Oe M v Z dfl + )6 G K v O O 6G HC + FFG O c x FIJ H I / Z /DB AFKJ v Z H GA8 BH GA L N 6 D R 6 / e +BL B DFK F6 / + I II E H FE F bO Od e FF G H /BEKM 6 oP Oi Q fub P+Xx aOd dfl hvn X sOo X 6 )FHJH EN N R • IW b • P e • IW b G C U bG / • P G b G C b H A b G / • IW U P • P U • IW • P H • IW P G • P P G U hvn X bdfl W U o fub U S
  4. 4. AI Bridging Cloud Infrastructure as World’s First Large-scale Open AI Infrastructure • Open, Public, and Dedicated infrastructure for AI/Big Data • Platform to accelerate joint academic-industry R&D for AI in Japan • Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP) ( )2 5 0.1 ,. 10 5 ## 0 # 4 Univ. Tokyo Kashiwa II Campus Operation Scheduled in 2018
  5. 5. • 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 43520 CPU Cores, 476TiB of Memory, 1.6PB of NVMe SSDs, 22PB of HDD-based Storage and Infiniband EDR network • Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC • Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc., commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack) 5 Gateway and Firewall Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP) 476 TiB Mem, 1.6 PB NVMe SSD Storage: 22 PB GPFS High Performance Computing Nodes (w/GPU) x1088 • Intel Xeon Gold6148 (2.4GHz/20cores) x2 • NVIDIA Tesla V100 (SXM2) x 4 • 384GiB Memory, 1.6TB NVMe SSD Multi-platform Nodes (w/o GPU) x10 • Intel Xeon Gold6132 (2.6GHz/14cores) x2 • 768GiB Memory, 3.8TB NVMe SSD Interactive Nodes DDN SFA14K (w/ SS8462 Enclosure x 10) x 3 • 12TB 7.2Krpm NL-SAS HDD x 2400 • 3.84TB SAS SSD x 216 • NSD Servers x 12 Object Storage for Protocol Nodes 100GbE Service Network (10GbE) External Networks SINET5 Interconnect (Infiniband EDR)
  6. 6. ABCI: AI Bridging Cloud Infrastructure 6 System (32 Racks)Rack (17 Chassis) Compute Node (4GPUs, 2CPUs) Chips (GPU, CPU) Node Chassis (2 Compute Nodes) NVIDIA Tesla V100 (16GB SMX2) 3.72 TB/s MEM BW 384 GiB MEM 200 Gbps NW BW 1.6TB NVMe SSD 1.16 PFlops(DP) 17.2 PFlops (AI) 37.2 PFlops(DP) 0.550 EFlops (AI) 68.5 PFlops(DP) 1.01 PFlops (AI) 34.2 TFlops(DP) 506 TFlops (AI) GPU: 7.8 TFlops(DP) 125 TFlops (AI) CPU: 1.53 TFlops(DP) 3.07 TFlops (AI) Intel Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) 0.550 EFlops(AI), 37.2 PFlops(DP) 19.88 PFlops(Peak), Ranked #5 Top500 June 2018 131TB/s MEM BW Full Bisection BW within Rack 70kW Max 1088 Compute Nodes 4352 GPUs 4.19 PB/s MEM BW 1/3 of Oversubscription BW 2.3MW
  7. 7. GPU Compute Nodes • NVIDIA TESLA V100 (16GB, SXM2) x 4 • Intel Xeon Gold 6148 x 2 Sockets – 20 cores per Socket • 384GiB of DDR4 Memory • 1.6TB NVMe SSD x 1 – Intel DC P4600 u.2 • EDR Infiniband HCA x 2 – Connected to other Compute Notes and Filesystems 7 Xeon Gold 6148 Xeon Gold 6148 10.4GT/s x3DDR4-2666 32GB x 6 DDR4-2666 32GB x 6 128GB/s 128GB/s IB HCA (100Gbps)IB HCA (100Gbps) NVMe UPI x3 x48 switch x64 switch Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 NVLink2 x2
  8. 8. Rack as Dense-packaged “Pod” ( AB < ) 1 6 ) 0 BC / 0 BC ,3 0G < F BA -7 BH<D G D CF BA -7 FB <IF<DA CB ) 7 7 4 I 7 F<D , D 2 D .BB A 8Pod #1 LEAF#1 (SB7890) LEAF#2 (SB7890) LEAF#3 (SB7890) LEAF#4 (SB7890) SPINE#1 (CS7500) SPINE#2 (CS7500) CX40 0#1 CX2570#1 CX2570#2 CX40 0#2 CX2570#3 CX2570#4 CX40 0#3 CX2570#5 CX2570#6 CX40 0#17 CX2570#33 CX2570#34 FBB#1 (SB7890) FBB#2 (SB7890) FBB#3 (SB7890) 1/3 Oversubscription BW IB-EDR x 24 Full bisection BW IB-EDR x 72 InfiniBand EDR x1 InfiniBand EDR x6 InfiniBand EDR x4 x 32 pods
  9. 9. Hierarchical Storage Tiers • Local Storage – 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node – Local Storage Aggregation w/ BeeOnd • Parallel Filesystem – 22PB of GPFS • DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set • Bare Metal NSD servers and Flash-based Metadata Volumes for metadata operation acceleration – Home and Shared Use • Object Storage – Part of GPFS using OpenStack Swift – S3-like API Access, Global Shared Use – Additional Secure Volumes w/ Encryption (Planned) 9 Parallel Filesystem Local Storage as Burst Buffers Object Storage as Campaign Storage
  10. 10. Performance Reference for Distributed Deep Learning 10 Better• Environments – ABCI 64 nodes (256 GPUs) – Framework: ChainerMN v1.3.0 • Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5 – Baremetal • CentOS 7.4, gcc-4.8.5, CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3 • Settings – Dataset: Imagenet-1K – Model: ResNet-50 – Training: • Batch size: 32 per GPU, 32 x 256 in total • Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch w/ warm up scheduling • Optimization: Momentum SGD (momentum=0.9) • Weight Decay: 0.0001 • Training Epoch: 100
  11. 11. LGD M P C N D M A D U N , L A D I M C 11 /home (GPFS) Job Job Job Job NQS Submit Scheduling script file $ qsub <option> script_filename inter-connect SSH ( ) G (High Throughput Computing)
  12. 12. /: / D 0 1 72 A : /: / D / 2 D 172 2 .2: , 2 ,C : ,/17/ 2 2 1 inkbc augk t v y 12 , Hgmeu P • 172 fp k Hh rs wogk L Q • lr t augk t T
  13. 13. . , 13 (cont’d) CUDA8.0 8.0.44 8.0.61.2 CUDA9.0 9.0.176 CUDA9.1 9.1.85 9.1.85.1 9.1.85.3 CuDNN5.1 5.1.5 5.1.10 CuDNN6.0 6.0.21 CuDNN7.0 7.0.5 CuDNN7.1 7.1.1 7.1.2 7.1.3 CUDA9.2 9.2.88.1 NCCL1.3 1.3.4 1.3.40-1 NCCL2.1NCCL2.0 NCCL2.2 2.0.5-3 2.1.4-1 2.1.15-1 2.2.12 OpenMPI MVAPICH2-GDR2.1.3 3.0.1 3.1.0 2.3a Python 2.7 3.5 3.6 Python Modules mpi4py matplotlibCython Pillow Jupyter Caffe2 CNTK ChainerMN Tensorflow MXNetNnabla
  14. 14. Software Stuck for ABCI • Batch Job Scheduler – High throughput computing • Minimum Pre-installed Software – Users can deploy their environments using anaconda, pip, python venv, etc. – Reduce operational cost • Container Support – Singularity for multi-node jobs w/ user customized images – Docker for single-node jobs w/ site-certified images 14 User Applications DL Frameworks Python, Ruby, R, Java, Scala, Perl, Lua, etc. ISV AppsOSSHadoop/ Spark GCC PGI OpenMPI MVAPICH2 Intel Parallel Studio XE Cluster Edition CUDA/CUDNN/NCCL GPFS BeeOND OpenStack Swift Univa Grid Engine Singularity Docker CentOS/RHEL
  15. 15. ABCI : Dynamic Container Deployment with HPC Linux Containers Linux Container (Singulairty, Docker) GPFS/Object Storage Compute Node Compute Node Compute Node Compute Node Container Image Container Image Container Image Container Image Job Job Job Job Job Scheduler Container Image Register/copy container images Import/copy container images Submit jobs with container images Container image repository (Dockerhub, private registry)
  16. 16. P D . , H O I CDHR D HAU C M P 16 CharlieCloud HPCEnterprise
  17. 17. ( ( D u Mvtr B pae C . : . /: CA C m p D l Mgo mi ws cn PS :: M b kpI M :: HL hpd ws D :/ I y :/ . x D fcU S I y , . / 17
  18. 18. ( ) sudo singularity build –sandbox tmpdir/ Singularity sudo singularity build –writable container.img Singularity sudo singularity build container.img Singularity sudo singularity build container.img docker://ubuntu sudo singularity build container.img shub://ubuntu ) S R sudo singularity shell –writable container.img D H R (, R ( ( , , , , container.img h S g singularity run container.img singularity exec container.img … singularity shell container.img a Sc ) ( e e
  19. 19. : /. # $ / . $ -$ -$ . $ S 19
  20. 20. 4 , C C GHI 4 4 ,4 M : c s a mn_e a e e s P B . . , kt M : c sP B Md v lN g I Bc s i ro . i NI up 4. 4 . - . , , . , . - 4. 4 - M : lN g l N 4. 4 g 20
  21. 21. - : - D 0- 8 D H 0- 6/ .8 / 8/ / - / . 0- / / 8 $ 8 - - 6/ .8 / -:- / - / . / / 6 / 21 GPU --nv
  22. 22. C B 3A3 : 3 0 : M C $ : 3 an f uV - a gi pN C $ c v - fo : 3 r P H ts I y m i NN _ O 3 - B leo . - m i x 22
  23. 23. S y x R 40 HIA ) 29 R m in - 03 H K67 O S 0D H K 0N G 9 MDIH ) R oad S HCN K M O S NHMN ) C ( C R g l S 0 HM8 C ( C S yup R b c- 4G C H M 5 R bl- : 7 M ( R - S e W - 29 U P ) S s - I D U S t r- 6IG HMNG 21 GIG HMNG. , S = CDM 1 - S h v- 23 : Better:
  24. 24. L i C 2 2 , 2 e D N GLbdP G GU M iA C 2 2 , 2 e 2 - ak M i P GU G L , , o n I G L c P 24 Base Drivers, Libraries on Host CUDA Drivers Infiniband Drivers Filesystem Libraries (GPFS, Lustre) Userland Libraries on Container CUDA CuDNN NCCL2 MPI (mpi4py) Mount ibverbs Distributed Deep Learning Frameworks Caffe2 ChainerMNDistributed TensorflowMXNet
  25. 25. 25
  26. 26. / . ) / m S C O = y t = i sn f y Sun v S C , =. / St i S C . / • C I /. , ) , ) , / ) O >– iU > ir ga lu bc S S ke o a 26
  27. 27. A P oce ʼ A Oʼ A L ʼ t uO P V P d A P S r ʼ oceA – uO V V R I • O x O o n M A () P ao oceO P A – i dP• oce O od L i dO () o d n oce O od 27
  28. 28. ei S Ll B sI B BCA I g e / R / ro Ra ei A L n ut 28
  29. 29. 29

×