Corralling Big Data at TACC


Published on

In this presentation from the DDN User Meeting at SC13, Tommy Minyard from the Texas Advanced Computing Center describes TACC's new Corral data storage system.

Watch the video presentation:

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Corralling Big Data at TACC

  1. 1. Corralling Big Data at TACC Tommy Minyard Texas Advanced Computing Center DDN User Group Meeting November 18, 2013
  2. 2. TACC Mission & Strategy The mission of the Texas Advanced Computing Center is to enable scientific discovery and enhance society through the application of advanced computing technologies. To accomplish this mission, TACC: – Evaluates, acquires & operates advanced computing systems – Provides training, consulting, and documentation to users – Collaborates with researchers to apply advanced computing techniques – Conducts research & development to produce new computational technologies Resources & Services Research & Development
  3. 3. TACC Storage Needs • Cluster specific storage – High performance (tens to hundreds GB/s bandwidth) – Large-capacity (~2TBs per Teraflop), purged frequently – Very scalable to thousands of clients • Center-wide persistent storage – Global filesystem available on all systems – Very large capacity, quota enabled – Moderate performance, very reliable, high availability • Permanent archival storage – Maximum capacity, tens of PBs of capacity – Slow performance, tape-based offline storage with spinning storage cache
  4. 4. History of DDN at TACC • 2006 – Lonestar 3 with DDN S2A9500 controllers and 120TB of disk • 2008 – Corral with DDN S2A9900 controller and 1.2PB of disk • 2010 – Lonestar 4 with DDN SFA10000 controllers with 1.8PB of disk • 2011 – Corral upgrade with DDN SFA10000 controllers and 5PB of disk
  5. 5. Global Filesystem Requirements • User requests for persistent storage available on all production systems – Corral limited to UT System users only • RFP issued for storage system capable of: – At least 20PB of usable storage – At least 100GB/s aggregate bandwidth – High availability and reliability • DDN solution selected for project
  6. 6. Stockyard: Design and Setup
  7. 7. Stockyard: Design and Setup • A Lustre 2.4.1 based global files system, with scalability for future upgrades • Scalable Unit (SU): 16 OSS nodes providing access to 168 OST’s of RAID6 arrays from two SFA12k couplets, corresponding to 5PB capacity and 25+ GB/s throughput per SU • Four SU’s provide 20PB with 100GB/s now • 16 initial LNET router set for external mounts
  8. 8. SU (One server rack with Two DDN SFA12k couplet racks)
  9. 9. SU Hardware Details • SFA12k Rack: 50U rack with 8x L6-30p • SFA12k couplet with 16 IB FDR ports (direct attachment to the 16 OSS servers) • 84 slot SS8460 drive enclosures (10 per rack, 20 enclosures per SU) • 4TB 7200RPM NL-SAS drives
  10. 10. Stockyard Logical Layout
  11. 11. Stockyard: Capabilities and Features • 20PB usable capacity with 100+ GB/s aggregate bandwidth • Client systems can bring its own LNET router set to connect to the Stockyard core IB switches or connect to the built-in LNET routers using either IB or TCP. (FDR14 or 10GigE) • HSM potential to Ranch tape archival system
  12. 12. Capabilities and Features (cont’d) • Meta-data performance enhancement possible with DNE (phase1) • NRS (Network Request Scheduler) evaluation: characteristics of different policies on ost_io.nrs_policies, particularly with crrn(client round-robin over nids) under contention dominated by a few jobs
  13. 13. Stockyard: Numbers So Far • 16 LET-routers configured as direct client (within the Stockyard fabric) can push 25GB/s on the unit • With two SU’s the same set of clients can achieve 50GB/s, and 75GB/s with three SU. • With four SU we hit the 16 client limit. No improvement beyond 75GB/s (corresponding to ~4.7GB/s from each client)
  14. 14. Numbers So Far (Single Client) • Single thread write performance with Lustre 2.4.1 is ~770MB/s – big improvement over 2.1.X at about 500MB/s • Multi-thread from a single client saturates around 4.7GB/s (with credits=256 on both servers and clients)
  15. 15. Numbers So Far (Aggregate) • Performance numbers with 16 lnet-routers : 75GB/s from 16 direct clients • Numbers from Stampede compute clients: 65GB/s with 256 clients (IOR, posix, fpp, with 8 tasks per node) • Saturation point for Stampede clients: 65GB/s • N.B. credits=64 on client nodes of Stampede – Quick test on interactive 2.1.x node with higher credit number gives expected boost.
  16. 16. Numbers So Far (Failover Tests) • OSS failover test setup and results • Procedure: – Identify the OST’s for the test pair – Initiate the dd processes targeted to the particular OST’s each of about 67GB in size so that it does not finish before the failover – Interrupt one of the OSS server with shutdown using ipmitool – Record the individual dd process outputs as well as server and client side Lustre messages – Compare and confirm the recovery and operation of the failover pair with 21 OST’s • All I/O completes within 2 minutes of failover
  17. 17. Failover Testing (cont’d) • Similarly for MDS pair: same sequence of interrupted I/O and collection of Lustre messages on both servers and clients, client side log shows the recovery. – – – – Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000 x1448277242593528/t0(0) o250>MGC192.168.200.10@o2ib100@ lens 400/544 e 0 to 1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at MGC192.168.200.10@o2ib100_1) after server handle changed from 0xb9929a99b6d258cd to 0x6282da9e97a66646 Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100: Connection restored to MGS (at
  18. 18. Automated Failover • The tests were on an artificial setup to simplify the tracking of the completion of the I/O on clients and shutdown and failover mounts were done manually. • Corosync and pacemaker are being set up to automate the process.
  19. 19. Routed Clients • We monitor the routerstat output on the attached routers and differences between two timestamps, focusing on the even distribution of request streams • Contrary to the expectation that “autodown” may suffice, Lustre clients need to have “check_routers_before_use=1” to have automatic updates of router status
  20. 20. Routed Clients (cont’d) • Even with automatic router checks, clients cannot detect the non-functional routers: a router which was active only on the client side will be assumed to be active by clients • Clients encounter timeouts due to the nonfunctional routers • Resolution: separate router checks on router nodes are added.
  21. 21. Stockyard: Looking Ahead • Deploy as a global $WORK space for TACC resources, will push the number of clients to all TACC resources • Evaluation of Lustre 2.5.0 before full production for HSM functionality and compatibility with SAMFS on Ranch • Quota management (different on 2.4+) • Integrated monitoring setup • Security evaluation
  22. 22. Summary • Storage capacity and performance needs growing at exponential rate • High-performance and reliable filesystems critical for HPC productivity • Benefits of large parallel filesystems outweigh the system administration overhead • Current best solution for cost, performance and scalability is Lustre-based filesystem