TACC Lonestar Cluster Upgrade to 300 Teraflops


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

TACC Lonestar Cluster Upgrade to 300 Teraflops

  1. 1. TACC Lonestar Cluster Upgrade to 300 Teraflops<br />Tommy Minyard<br />SC10<br />November 16, 2010<br />
  2. 2. TACC Mission & Strategy<br />The mission of the Texas Advanced Computing Center is to enable discoveries that advance science and society through the application of advanced computing technologies.<br /> To accomplish this mission, TACC: <br />Evaluates, acquires & operatesadvanced computing systems<br />Provides training, consulting, anddocumentation to users <br />Collaborates with researchers toapply advanced computing techniques<br />Conducts research & development toproduce new computational technologies<br />Resources & Services<br />Research & Development<br />
  3. 3. TACC Staff Expertise<br />Operating as an Advanced Computing Center since 1986<br />More than 80 Employees at TACC<br />20Ph.D. level research staff<br />Graduate and undergraduate students<br />Currently support thousands of users on production systems<br />
  4. 4. TACC Resources are Comprehensive and Balanced<br />HPC systems to enable larger simulations analyses and faster turnaround times <br />Scientific visualization resources to enable large data analysis and knowledge discovery<br />Data & information systems to store large datasets from simulations, analyses, digital collections, instruments, and sensors<br />Distributed/grid computing servers & software to integrate all resources into computational grids<br />Network equipment for high-bandwidth data movements and transfers between systems <br />
  5. 5. TACC’s Migration Towards HPC Clusters<br />1986: TACC founded<br />Historically had large Cray systems<br />2000: First experimental cluster<br />16 AMD workstations<br />2001: First production clusters<br />64-processor Pentium III Linux cluster<br />20-processor Itanium Linux cluster<br />2003: First terascale cluster, Lonestar<br />1028-processor Dell Xeon Linux cluster<br />2006: Largest US academic cluster deployed<br />5840-processor cores 64-bit Xeon Linux cluster<br />
  6. 6. Current Dell Production Systems<br />Lonestar – 1460 node, dual-core InfiniBand HPC production system, 62 Teraflops<br />Longhorn– 256 node, quad-core Nehalem, visualization and GPGPU computing cluster<br />Colt – 10 node high-end visualization system with 3x3 tiled wall display<br />Stallion – 23 node, large scale Vis system with 15x5 tiled wall display (more than 300M pixels)<br />Discovery – 90 node benchmark system with variety of processors, InfiniBand DDR & QDR<br />
  7. 7. TACC Lonestar System<br />Dell Dual-Core 64-bit Xeon Linux Cluster5840 CPU cores (62.1 Tflops)<br />10+ TB memory, 100+ TB disk<br />
  8. 8. Galerkin wave propagation<br />Lucas Wilcox, Institute for Computational Engineering and Sciences, UT-Austin<br />Seismic wave propagation, PI Omar Ghattaspart of research recently on cover of Sciencefinalist for Gordon Bell Prize at SC10<br />
  9. 9.
  10. 10.
  11. 11. Molecular Dynamics<br />David LeBard, Institute for Computational Molecular Science, Temple University<br />Pretty Fast Analysis: A software suite for analyzing large scale simulations on supercomputers and GPU clusters<br />Presented to the American Chemical Society, August 2010<br />
  12. 12. PFA example: E(r) around lysozyme<br />4,311x<br />4,311x<br />EOH(r) = rOH . E(r), calculating P(EOH)<br />
  13. 13. Lonestar Upgrade<br />Current Lonestar already 4+ years of operation<br />Needed replacement to support UT and TeraGrid users along with several other large projects<br />Submitted proposal to NSF with matching UT funds along with funds from UT-ICES, Texas A&M and Texas Tech<br />
  14. 14. New Lonestar Summary<br />Compute power – 301.7 Teraflops<br />1,888 Dell M610 two-socket blades<br />Intel X56803.33GHz six-core “Westmere” processors<br />22,656 total processing cores<br />Memory – 44 Terabytes<br />2 GB/core, 24 GB/node<br />132 TB/s aggregate memory bandwidth<br />Disk subsystem – 1.2 Petabytes<br />Two DDN SFA10000 controllers, 300 2TB drives each<br />~20 GB/sec total aggregate I/O bandwidth<br /> 2 MDS, 16 OSS nodes<br />Interconnect – InfiniBand QDR<br />Mellanox 648-port InfiniBand switches (4)<br />Full non-blocking fabric<br />Mellanox ConnectX-2 InfiniBand cards<br />
  15. 15. System design challenges<br />Limited by power and cooling<br />X5680 processor 130 Watts per socket!<br />M1000e chassis fully populated ~7kW of power<br />Three M1000e chassis per rack <br />21kW per rack<br />Six 208V, 30-amp circuits per rack<br />Forty total compute racks, four switch racks<br />Planning mix of underfloor and overhead cabling<br />
  16. 16.
  17. 17.
  18. 18. Software Stack<br />Reevaluating current cluster management kits and resource managers/schedulers<br />Platform PCM/LSF<br />Univa UD<br />Bright Cluster Manager<br />SLURM, MOAB, PBS, Torque<br />Current plan:<br />TACC customcluster installandadministration scripts<br />SGE 6.2U5<br />Lustre 1.8.4<br />Intel Compilers<br />MPI Libraries: MVAPICH, MVAPICH2, OpenMPI<br />
  19. 19. Questions?<br />