Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Login to see the comments

Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer

  1. 1. Introduction toNational Supercomputer center in Tianjin TH-1A Supercomputer
  2. 2. Agenda� National Supercomputer Center in Tianjin( NSCC-TJ)� TH-1A system � Hardware sub-system � Software sub-system� Applications
  3. 3. NSCC-TJ� National SuperComputer Center in Tianjin � Sponsored by � Chinese Ministry of Science and Technology � Tianjin Binhai New Area � Public information infrastructure � To accelerate the economy, education and industry of Northern China � To provide high performance computing service to whole China � Open platform for research and education
  4. 4. NSCC-TJ Main building office Computer roomTransformer station & Total area: 2400m2air conditioner
  5. 5. NSCC-TJ The first floor of central computing room: 1200m2
  6. 6. NSCC-TJ The second floor of central computing room: Visualization environment, 1200m2
  7. 7. NSCC-TJ Electric transformer station
  8. 8. NSCC-TJ Cooling water station2011-6-28 TH-1 8
  9. 9. NSCC-TJ� Layout of computing room
  10. 10. TH-1A system
  11. 11. TH-1A system� Enhanced system based on TH-1 system (Sep. 2009)� Installed in NSCC-TJ, Aug. 2010� Debugging and performance testing, Sept.~Oct. 2010 Sept.~Oct.� On service, after Nov. 2010 Items Configuration Processors 14336 Intel CPUs + 7168 nVIDIA GPUs + 2048FT CPUs Memory 262TB in total Interconnect Proprietary high-speed interconnecting network Storage 2PB 120 Compute / service Cabinets Cabinets 14 Storage Cabinets 6 Communication Cabinets
  12. 12. TH-1A system� TH-1A System Architecture � Hybrid MPP structure: CPU & GPU � Proprietary compute nodes � Connected by proprietary high-speed interconnect network � Global shared parallel storage system � Custom software stack
  13. 13. TH-1A hardware sub-system Service Service Compute sub-system Compute sub-system sub-system sub-system CPU CPU CPU CPU CPU … Operation Operation diagnosis sub-system diagnosis sub-system + + + + + node node GPU GPU GPU GPU GPU Monitor and Monitor and Communication sub-system Communication sub-system Storage sub-system Storage sub-system MDS … OSS OSS OSS OSS
  14. 14. Compute sub-system� 7,168 compute nodes � 2 six-core CPU and 1 GPU per node � CPU �Xeon X5670 ( Westmere ) (Westmere Westmere) �Processor speed - 2.93GHz � GPU �NVIDIA Tesla M2050 �Connected with CPU by PCI-E � 32GB memory per node � 2U height � Peak performance �4,701,061Gflops
  15. 15. Service sub-system� 1,024 service nodes � 2 eight-core domestic CPUs � CPU: FT-1000 � SoC � 1.0GH z 1.0GHz � Eight-core, eight-thread per ight-core, core � Peak performance 8Gflops � 32GB memory per node � For login, compile, and applications need throughput computing
  16. 16. Proprietary interconnection network� Interconnection signal speed – 10Gbps� Bi-directional bandwidth – 160Gbps� Hierarchy fat-tree structure � First stage: 16 nodes connected by 16-port switching board � Second stage: all parts connected to eleven 384-port switches
  17. 17. Proprietary interconnection network � High radix router ASIC:NRC ASIC: � Feature size :90nm � Die size:17.16mm x 17.16mm size: � Package :FC-PBGA Package: � 2577 pins � Throughput of single NRC: 2.56Tbps � Network interface ASIC:NIC � Same feature size and package as NRC � Die size :10.76mm x 10.76mm size: � 675 pins
  18. 18. Proprietary interconnection network 16-port switch board in cabinet Leaf switch blade and Root switch blade of 384-ports switch Back plane of 384-ports switch about 700mm *600mm 700mm*
  19. 19. Proprietary interconnecting network� Switching board and high-radix switch � Based on network interface ASIC and router ASIC� Reduced user communication protocol� Throughput: 61.44Tbps Front two 384-port high-radix switches Back
  20. 20. Storage sub-system� Capacity: 2 PB� Connected by proprietary interconnection network� Lustre based parallel file system
  21. 21. Monitor and diagnosis sub-system � Rich monitor & control functions � Real-time monitor hardware parameters � Precise fault position � Alarm and immediate action against emergency � Self-feedback cool adjust for environment status � I2C & JTAG diagnosis mechanism � Large scale console � Remote monitor and management
  22. 22. Computing cabinet� Node: 2 CPUs and 1 GPU� Blade: 2 nodes� Frame � 8 computing blades � 16-port switching board � 1 monitor and diagnosis board� Cabinet � 4 frames, 64 nodes� Close-coupled chilled water cooling � 128 CPUs, 64 GPU � 56KW cooling capacity in a cabinet� Footprint � 700m2
  23. 23. TH-1A software sub-system� Software stack
  24. 24. Operating system� Kylin Linux� compute node kernel� Provide virtual running environment � Isolated running environments for different users � Custom software package installation� QoS support� Power aware computing
  25. 25. Compiler system� C, C++, Fortran, Java� OpenMP, MPI, OpenMP/MPI OpenMP, OpenMP/MPI� CUDA, OpenCL� Heterogeneous programming framework � Accelerate the large scale, complex applications, especially for applications in developing status or their full source codes are not available � Use the computing power of CPUs and GPUs, hide the GPU GPUs, programming to users � Inter-node homogeneous parallel programming (users) � Intra-node heterogeneous parallel computing (computer experts)
  26. 26. Compiler system� Heterogeneous programming framework � Inter-node homogeneous parallel programming (JASMIN) � Patch-based objects data structures � MPI communication, dynamic load balancing support � Zero-copy optimization in communication library
  27. 27. Compiler system� Heterogeneous programming framework � Intra-node heterogeneous parallel computing � Compiler optimized / hand-tuned threaded code � Optimizations include � Adaptive partitioning, balance the workloads between CPUs and GPU � Asynchronous data transfer / computing, overlap CPU operations with GPU operations � Software pipelining, overlap GPU computing with data transfer between host and GPU device memory � ……
  28. 28. Compiler system� Heterogeneous programming framework � An example: 3-D short range molecular simulations � For each time step � Split workload (force calculation) between CPU and GPU � For each patch allocated to GPU � Start asynchronous operations: transfer the patch data to GPU, compute the patch, get results from GPU � For each patch allocated to CPU � Launch threads on CPU cores to compute the patch � CPU waits for GPU completion event � Adjust the split value according to the CPU/GPU performance (patches per second + empirical ) � Other workload (velocity, position) computed on CPU � Performance: one NVIDIA M2050 GPU is 3 times faster than one Intel X5670 CPU
  29. 29. Programming environment� Virtual running environments � Provide services on demand� Parallel toolkits � Based on Eclipse � To integrate all kinds of tools � Editor, debugger, profiler� Work flow support � Support QoS negotiate � Reserve resource for future requirement
  30. 30. Visualization system� Application area � Numerical weather forecast � Computational fluid dynamics � Oil exploration � Other large-scale data� Computing platform � Tianhe-1A� Render server � 128 CPU + 64 GPU� Display device � 3x6 multi-channel display wall
  31. 31. Applications � Oil exploration � High-end equipment development � Bio-medical research � Animation design � New energy research � New material research � Weather and climate forecasting � Engineering design, simulation and analysis � Remote sensing data processing � Financial risk analysis
  32. 32. Thanks

×