Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

1,654 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,654
On SlideShare
0
From Embeds
0
Number of Embeds
50
Actions
Shares
0
Downloads
77
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

  1. 1. Introduction toNational Supercomputer center in Tianjin TH-1A Supercomputer
  2. 2. Agenda� National Supercomputer Center in Tianjin( NSCC-TJ)� TH-1A system � Hardware sub-system � Software sub-system� Applications
  3. 3. NSCC-TJ� National SuperComputer Center in Tianjin � Sponsored by � Chinese Ministry of Science and Technology � Tianjin Binhai New Area � Public information infrastructure � To accelerate the economy, education and industry of Northern China � To provide high performance computing service to whole China � Open platform for research and education
  4. 4. NSCC-TJ Main building office Computer roomTransformer station & Total area: 2400m2air conditioner
  5. 5. NSCC-TJ The first floor of central computing room: 1200m2
  6. 6. NSCC-TJ The second floor of central computing room: Visualization environment, 1200m2
  7. 7. NSCC-TJ Electric transformer station
  8. 8. NSCC-TJ Cooling water station2011-6-28 TH-1 8
  9. 9. NSCC-TJ� Layout of computing room
  10. 10. TH-1A system
  11. 11. TH-1A system� Enhanced system based on TH-1 system (Sep. 2009)� Installed in NSCC-TJ, Aug. 2010� Debugging and performance testing, Sept.~Oct. 2010 Sept.~Oct.� On service, after Nov. 2010 Items Configuration Processors 14336 Intel CPUs + 7168 nVIDIA GPUs + 2048FT CPUs Memory 262TB in total Interconnect Proprietary high-speed interconnecting network Storage 2PB 120 Compute / service Cabinets Cabinets 14 Storage Cabinets 6 Communication Cabinets
  12. 12. TH-1A system� TH-1A System Architecture � Hybrid MPP structure: CPU & GPU � Proprietary compute nodes � Connected by proprietary high-speed interconnect network � Global shared parallel storage system � Custom software stack
  13. 13. TH-1A hardware sub-system Service Service Compute sub-system Compute sub-system sub-system sub-system CPU CPU CPU CPU CPU … Operation Operation diagnosis sub-system diagnosis sub-system + + + + + node node GPU GPU GPU GPU GPU Monitor and Monitor and Communication sub-system Communication sub-system Storage sub-system Storage sub-system MDS … OSS OSS OSS OSS
  14. 14. Compute sub-system� 7,168 compute nodes � 2 six-core CPU and 1 GPU per node � CPU �Xeon X5670 ( Westmere ) (Westmere Westmere) �Processor speed - 2.93GHz � GPU �NVIDIA Tesla M2050 �Connected with CPU by PCI-E � 32GB memory per node � 2U height � Peak performance �4,701,061Gflops
  15. 15. Service sub-system� 1,024 service nodes � 2 eight-core domestic CPUs � CPU: FT-1000 � SoC � 1.0GH z 1.0GHz � Eight-core, eight-thread per ight-core, core � Peak performance 8Gflops � 32GB memory per node � For login, compile, and applications need throughput computing
  16. 16. Proprietary interconnection network� Interconnection signal speed – 10Gbps� Bi-directional bandwidth – 160Gbps� Hierarchy fat-tree structure � First stage: 16 nodes connected by 16-port switching board � Second stage: all parts connected to eleven 384-port switches
  17. 17. Proprietary interconnection network � High radix router ASIC:NRC ASIC: � Feature size :90nm � Die size:17.16mm x 17.16mm size: � Package :FC-PBGA Package: � 2577 pins � Throughput of single NRC: 2.56Tbps � Network interface ASIC:NIC � Same feature size and package as NRC � Die size :10.76mm x 10.76mm size: � 675 pins
  18. 18. Proprietary interconnection network 16-port switch board in cabinet Leaf switch blade and Root switch blade of 384-ports switch Back plane of 384-ports switch about 700mm *600mm 700mm*
  19. 19. Proprietary interconnecting network� Switching board and high-radix switch � Based on network interface ASIC and router ASIC� Reduced user communication protocol� Throughput: 61.44Tbps Front two 384-port high-radix switches Back
  20. 20. Storage sub-system� Capacity: 2 PB� Connected by proprietary interconnection network� Lustre based parallel file system
  21. 21. Monitor and diagnosis sub-system � Rich monitor & control functions � Real-time monitor hardware parameters � Precise fault position � Alarm and immediate action against emergency � Self-feedback cool adjust for environment status � I2C & JTAG diagnosis mechanism � Large scale console � Remote monitor and management
  22. 22. Computing cabinet� Node: 2 CPUs and 1 GPU� Blade: 2 nodes� Frame � 8 computing blades � 16-port switching board � 1 monitor and diagnosis board� Cabinet � 4 frames, 64 nodes� Close-coupled chilled water cooling � 128 CPUs, 64 GPU � 56KW cooling capacity in a cabinet� Footprint � 700m2
  23. 23. TH-1A software sub-system� Software stack
  24. 24. Operating system� Kylin Linux� compute node kernel� Provide virtual running environment � Isolated running environments for different users � Custom software package installation� QoS support� Power aware computing
  25. 25. Compiler system� C, C++, Fortran, Java� OpenMP, MPI, OpenMP/MPI OpenMP, OpenMP/MPI� CUDA, OpenCL� Heterogeneous programming framework � Accelerate the large scale, complex applications, especially for applications in developing status or their full source codes are not available � Use the computing power of CPUs and GPUs, hide the GPU GPUs, programming to users � Inter-node homogeneous parallel programming (users) � Intra-node heterogeneous parallel computing (computer experts)
  26. 26. Compiler system� Heterogeneous programming framework � Inter-node homogeneous parallel programming (JASMIN) � Patch-based objects data structures � MPI communication, dynamic load balancing support � Zero-copy optimization in communication library
  27. 27. Compiler system� Heterogeneous programming framework � Intra-node heterogeneous parallel computing � Compiler optimized / hand-tuned threaded code � Optimizations include � Adaptive partitioning, balance the workloads between CPUs and GPU � Asynchronous data transfer / computing, overlap CPU operations with GPU operations � Software pipelining, overlap GPU computing with data transfer between host and GPU device memory � ……
  28. 28. Compiler system� Heterogeneous programming framework � An example: 3-D short range molecular simulations � For each time step � Split workload (force calculation) between CPU and GPU � For each patch allocated to GPU � Start asynchronous operations: transfer the patch data to GPU, compute the patch, get results from GPU � For each patch allocated to CPU � Launch threads on CPU cores to compute the patch � CPU waits for GPU completion event � Adjust the split value according to the CPU/GPU performance (patches per second + empirical ) � Other workload (velocity, position) computed on CPU � Performance: one NVIDIA M2050 GPU is 3 times faster than one Intel X5670 CPU
  29. 29. Programming environment� Virtual running environments � Provide services on demand� Parallel toolkits � Based on Eclipse � To integrate all kinds of tools � Editor, debugger, profiler� Work flow support � Support QoS negotiate � Reserve resource for future requirement
  30. 30. Visualization system� Application area � Numerical weather forecast � Computational fluid dynamics � Oil exploration � Other large-scale data� Computing platform � Tianhe-1A� Render server � 128 CPU + 64 GPU� Display device � 3x6 multi-channel display wall
  31. 31. Applications � Oil exploration � High-end equipment development � Bio-medical research � Animation design � New energy research � New material research � Weather and climate forecasting � Engineering design, simulation and analysis � Remote sensing data processing � Financial risk analysis
  32. 32. Thanks

×