Joint Techs Workshop, TIP 2004Jan 28, 2004Honolulu, Hawaii            Trans-Pacific Grid Datafarm                         ...
Key points of this talk Trans-pacific Grid file system and testbed    70 TBytes disk capacity, 13 GB/sec disk I/O performa...
[Background] Petascale Data Intensive ComputingHigh Energy Physics     CERN LHC, KEK Belle        ~MB/collision,        10...
[Background 2] Large-scale File Sharing P2P – exclusive and special-purpose approach    Napster, Gnutella, Freenet, . . . ...
Goal and feature of Grid Datafarm Goal    Dependable data sharing among multiple organizations    High-speed data access, ...
Grid Datafarm (1): Gfarm file system -       World-wide virtual file system [CCGrid 2002]            Transparent access to...
Grid Datafarm (2): High-performance data accessand processing support [CCGrid 2002] World-wide parallel and distributed pr...
Transfer technology in long fat networks Bandwidth and latency between US and Japan   1 10 Gbps, 150 300 msec in RTT TCP a...
Multiple TCP streams sometimes   considered harmful . . .      Multiple TCP streams achieve good bandwidth, but      exces...
A programmable network testbed deviceGNET-1                            Large high-speed                            memory ...
IFG-based pace control by GNET-1                        Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))          ...
Summary of technologies for performanceimprovement [Disk I/O performance] Grid Datafarm – A Grid file system with high- pe...
Trans-Pacific Grid Datafarm testbed:   Network and cluster configuration                 SuperSINET Trans-Pacific thoretic...
Scientific Data for Bandwidth Challenge Trans-Pacific File Replication of scientific data    For transparent, high-perform...
Network bandwidth in APAN/TransPAC LA routePC                                           RTT: 141 ms                       ...
APAN/TransPAC LA route (1)      National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA route (2)      National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA route (3)      National Institute of Advanced Industrial Science and Technology
File replication between Japan and US  (network configuration) PC                                                   RTT: 1...
File replication performance between Japanand US (total)       National Institute of Advanced Industrial Science and Techn...
APAN/TransPAC Chicago               Pacing at 500 Mbps, quite stable     National Institute of Advanced Industrial Science...
APAN/TransPAC LA (1)              After re-pacing from 800 to 780 Mbps, quite stable      National Institute of Advanced I...
APAN/TransPAC LA (2)              After re-pacing of LA (1), quite stable      National Institute of Advanced Industrial S...
APAN/TransPAC LA (3)              After re-pacing of LA (1), quite stable      National Institute of Advanced Industrial S...
SuperSINET NYC           Re-pacing from 930 to 950 Mbps     National Institute of Advanced Industrial Science and Technology
Summary Efficient use around the peak rate in long fat networks      IFG-based precise pacing within packet loss free band...
Future work Standardization effort with GGF Grid File System WG    Foster (world-wide) storage sharing and integration    ...
Special thanks to Hirotaka Ogawa, Yuetsu Kodama, Tomohiro Kudoh, Satoshi Sekiguchi (AIST), Satoshi Matsuoka, Kento Aida (T...
Upcoming SlideShare
Loading in …5
×

Gfarm Fs Tatebe Tip2004

930 views

Published on

Published in: Spiritual, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
930
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Gfarm Fs Tatebe Tip2004

  1. 1. Joint Techs Workshop, TIP 2004Jan 28, 2004Honolulu, Hawaii Trans-Pacific Grid Datafarm Osamu Tatebe Grid Technology Research Center, AIST On behalf of the Grid Datafarm Project National Institute of Advanced Industrial Science and Technology
  2. 2. Key points of this talk Trans-pacific Grid file system and testbed 70 TBytes disk capacity, 13 GB/sec disk I/O performance Trans-pacific file replication [SC2003 Bandwidth Challenge] 1.5TB data transferred in an hour Multiple high-speed Trans-Pacific networks; APAN/TransPAC (2.4 Gbps OC48 POS, 500 Mbps OC-12 ATM), SuperSINET (2.4 Gbps x 2, 1 Gbps available) 6,000 miles stable 3.79 Gbps out of theoretical peak 3.9 Gbps (97%) using 11 node pairs (MTU 6000B) We won the "Distributed Infrastructure" award! National Institute of Advanced Industrial Science and Technology
  3. 3. [Background] Petascale Data Intensive ComputingHigh Energy Physics CERN LHC, KEK Belle ~MB/collision, 100 collisions/sec Detector for LHCb experiment ~PB/year 2000 physicists, 35 countries Detector for ALICE experiment Astronomical Data Analysis data analysis of whole the data TB~PB/year/telescope SUBARU telescope 10 GB/night, 3 TB/year National Institute of Advanced Industrial Science and Technology
  4. 4. [Background 2] Large-scale File Sharing P2P – exclusive and special-purpose approach Napster, Gnutella, Freenet, . . . Grid technology – file transfer, metadata management GridFTP, Replica Location Service Storage Resource Broker (SRB) Large-scale file system – general approach Legion, Avaki [Grid, no replica management] Grid Datafarm [Grid] Farsite, OceanStore [P2P] AFS, DFS, . . . National Institute of Advanced Industrial Science and Technology
  5. 5. Goal and feature of Grid Datafarm Goal Dependable data sharing among multiple organizations High-speed data access, High-speed data processing Grid Datafarm Grid File System – Global dependable virtual file system Integrates CPU + storage Parallel & distributed data processing Features Secured based on Grid Security Infrastructure Scalable depending on data size and usage scenarios Data location transparent data access Automatic and transparent replica access for fault tolerance High-performance data access and processing by accessing multiple dispersed storages in parallel (file affinity scheduling) National Institute of Advanced Industrial Science and Technology
  6. 6. Grid Datafarm (1): Gfarm file system - World-wide virtual file system [CCGrid 2002] Transparent access to dispersed file data in a Grid POSIX I/O APIs, and native Gfarm APIs for extended file view semantics and replications Map from virtual directory tree to physical file Automatic and transparent replica access for fault tolerance and access-concentration avoidance Virtual Directory /grid Tree File system metadata ggf jpaist gtrc file1 file2 mapping file1 file2 file3 file4 File replica creation Gfarm File System National Institute of Advanced Industrial Science and Technology
  7. 7. Grid Datafarm (2): High-performance data accessand processing support [CCGrid 2002] World-wide parallel and distributed processing Aggregate of files = superfile Data processing of superfiles = parallel and distributed data processing of member files Local file view (SPMD parallel file access) File-affinity scheduling (“Owner-computes”) World-wide Virtual CPU Parallel & distributed processing Grid File System Astronomic archival data 365 parallel analysis in a year (superfile) National Institute of Advanced Industrial Science and Technology
  8. 8. Transfer technology in long fat networks Bandwidth and latency between US and Japan 1 10 Gbps, 150 300 msec in RTT TCP acceleration Adjustment of congestion window Multiple TCP connections HighSpeed TCP、Scalable TCP、FAST TCP XCP (not TCP) UDP based acceleration Tsunami、UDT、RBUDP、atou、. . . Bandwidth prediction without packet loss National Institute of Advanced Industrial Science and Technology
  9. 9. Multiple TCP streams sometimes considered harmful . . . Multiple TCP streams achieve good bandwidth, but excessively congest the network. In fact would “shoot oneself in the foot”. APAN/TransPAC LA-Tokyo (2.4Gbps) 2800 Too muchHigh oscillation 2600 congestionNot stable! 2400 2200 2000 Bandwidth (Mbp 1800 1600 TxTotal TxBW0 1400 TxBW1 1200 TxBW2 1000Compensate 800each other 600 400 200 0 375.5 Too much 377 376 376.5 network flow 377.5 378 Time (seconds) [10 msec average] Need to limit bandwidth appropriately National Institute of Advanced Industrial Science and Technology
  10. 10. A programmable network testbed deviceGNET-1 Large high-speed memory blocks Programmable hardware network testbed WAN emulation - latency, bandwidth, packet loss, jitter, . . . Precise measurement - bandwidth in 100 usec - latency, jitter between 2 GNET-1 General purpose, very flexible! National Institute of Advanced Industrial Science and Technology
  11. 11. IFG-based pace control by GNET-1 Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps)) Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps)) Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps)) 1000 1000 1000 900 900 900 800 800 800 700 700 700 Bandwidth (Mb Bandwidth (Mbp Bandwidth (Mbp 600 600 600 RxBW0 TxBW0 TxBW0 500 TxBW1 TxBW1 RxBW1 500 500 400 TxBW2 TxBW2 400 400 300 300 300 200 200 200 100 100 100 0 0 0 245.5 246 246.5 247 245.5 246 246.5 247 245.5 246 246.5 247 Time (Second) Time (Second) Time (Second) GNET-1 Bottleneck 1 Gbps 700 Mbps 700 Mbps (enable flow control) NO PACKET LOSS! GNET-1 provides Precise traffic pacing at any data rate by changing IFG (Inter-Frame Gap) Packet loss free network using large input buffer (16MB) National Institute of Advanced Industrial Science and Technology
  12. 12. Summary of technologies for performanceimprovement [Disk I/O performance] Grid Datafarm – A Grid file system with high- performance data-intensive computing support A world-wide virtual file system that federates local file systems of multiple clusters It provides scalable disk I/O performance for file replication via high- speed network links and large-scale data-intensive applications Trans-Pacific Grid Datafarm testbed 5 clusters in Japan, 3 clusters in US, and 1 cluster in Thailand, provides 70 TBytes disk capacity, 13 GB/sec disk I/O performance It supports file replication for fault tolerance and access-concentration avoidance [World-wide high-speed network efficient utilization] GNET-1 – a gigabit network testbed device Provides IFG-based precise rate-controlled flow at any rate Enables stable and efficient Trans-Pacific network use of HighSpeed TCP National Institute of Advanced Industrial Science and Technology
  13. 13. Trans-Pacific Grid Datafarm testbed: Network and cluster configuration SuperSINET Trans-Pacific thoretical peak 3.9 Gbps Indiana Gfarm disk capacity 70 TBytes Univ Titech disk read/write 13 GB/sec 147 nodes 16 TBytes 10G SuperSINET 4 GB/sec SC2003 Univ 2.4G Tsukuba NII New 2.4G Phoenix 10 nodes 10G York 1 TBytes 2.4G(1G) 300 MB/sec 10G [950 Mbps] Abilene Abilene KEK [500 Mbps] 7 nodes 1G OC-12 ATM 3.7 TBytes 622M Chicago 200 MB/sec APAN 1G 10G Maffin Tokyo XP APAN/TransPAC 1G 1G 32 nodes AIST 5G 2.4G Los Angeles [2.34 Gbps] 23.3 TBytes 10G Tsukuba SDSC 2 GB/sec16 nodes 16 nodes Kasetsart WAN11.7 TBytes 11.7 TBytes Univ,1 GB/sec Thailand 1 GB/sec National Institute of Advanced Industrial Science and Technology
  14. 14. Scientific Data for Bandwidth Challenge Trans-Pacific File Replication of scientific data For transparent, high-performance, and fault-tolerant access Astronomical Object Survey on Grid Datafarm [HPC Challenge participant] World-wide data analysis on whole the archive 652 GBytes data observed by SUBARU telescope N. Yamamoto (AIST) Large configuration data from Lattice QCD Three sets of hundreds of gluon field configurations on a 24^3*48 4-D space-time lattice (3 sets x 364.5 MB x 800 = 854.3 GB) Generated by the CP-PACS parallel computer at Center for Computational Physics, Univ. of Tsukuba (300Gflops x years of CPU time) [Univ Tsukuba Booth] National Institute of Advanced Industrial Science and Technology
  15. 15. Network bandwidth in APAN/TransPAC LA routePC RTT: 141 ms PC switch switchPC 3G 3G PC 10G 2.4GPC FC10 Juniper PC switch router switchPC E600 M20 PCPC switch LA Tokyo switch PCGNET-1 Stable transfer rate of 2.3 Gbps 2 [Gbps] 1 No pacing Pacing in 2.3 Gbps (900 + 900 + 500) National Institute of Advanced Industrial Science and Technology
  16. 16. APAN/TransPAC LA route (1) National Institute of Advanced Industrial Science and Technology
  17. 17. APAN/TransPAC LA route (2) National Institute of Advanced Industrial Science and Technology
  18. 18. APAN/TransPAC LA route (3) National Institute of Advanced Industrial Science and Technology
  19. 19. File replication between Japan and US (network configuration) PC RTT: 141 ms PC switch switch PC 10G PC 3G 3G PC PC switch LA Tokyo switch PC 2.4G PC PC router PC Abilene Abilene switch Juniper switch PC FC10 PC M20 E600 PC PC switch router switch PC 500M 1G PC 1G Chicago RTT: 250 ms PC PC PC switch 1G router router switch PC PC 2.4G 1G PC (1G) NYC RTT: 285 msPhoenix GNET-1 Tokyo, Tsukuba National Institute of Advanced Industrial Science and Technology
  20. 20. File replication performance between Japanand US (total) National Institute of Advanced Industrial Science and Technology
  21. 21. APAN/TransPAC Chicago Pacing at 500 Mbps, quite stable National Institute of Advanced Industrial Science and Technology
  22. 22. APAN/TransPAC LA (1) After re-pacing from 800 to 780 Mbps, quite stable National Institute of Advanced Industrial Science and Technology
  23. 23. APAN/TransPAC LA (2) After re-pacing of LA (1), quite stable National Institute of Advanced Industrial Science and Technology
  24. 24. APAN/TransPAC LA (3) After re-pacing of LA (1), quite stable National Institute of Advanced Industrial Science and Technology
  25. 25. SuperSINET NYC Re-pacing from 930 to 950 Mbps National Institute of Advanced Industrial Science and Technology
  26. 26. Summary Efficient use around the peak rate in long fat networks IFG-based precise pacing within packet loss free bandwidth with GNET-1 -> packet loss free network Stable network flow even with HighSpeed TCP Disk I/O performance improvement Parallel disk access using Gfarm Trans-pacific file replication performance: 3.79Gbps out of theoretical peak 3.9 Gbps (97%) using 11 node pairs (MTU 6000B) 1.5TB data was transferred in an hour Linux 2.4 kernel problem during file replication (transfer) Network transfer stopped in a few minutes when flushing buffer cache to disk Linux kernel bug? Defensive solution: set very short interval for buffer cache flush This limits file transfer rate to 400 Mbps for one node pair Successful Trans-pacific scale data analysis . . . Scalability problem of LDAP server for a metadata server Further improvement needed National Institute of Advanced Industrial Science and Technology
  27. 27. Future work Standardization effort with GGF Grid File System WG Foster (world-wide) storage sharing and integration dependable data sharing, high-performance data access among several organizations Application area High energy physics experiment Astronomic data analysis Bioinformatics, . . . Dependable data processing in eGovernment and eCommerce Other applications that needs dependable file sharing among several organizations National Institute of Advanced Industrial Science and Technology
  28. 28. Special thanks to Hirotaka Ogawa, Yuetsu Kodama, Tomohiro Kudoh, Satoshi Sekiguchi (AIST), Satoshi Matsuoka, Kento Aida (Titech), Taisuke Boku, Mitsuhisa Sato (Univ Tsukuba), Youhei Morita (KEK), Yoshinori Kitatsuji (APAN Tokyo XP), Jim Williams, John Hicks (TransPAC/Indiana Univ) Eguchi Hisashi (Maffin), Kazunori Konishi, Jin Tanaka, Yoshitaka Hattori (APAN), Jun Matsukata (NII), Chris Robb (Abilene) Tsukuba WAN NOC team, APAN NOC team, NII SuperSINET NOC team Force10 Networks PRAGMA, ApGrid, SDSC, Indiana University, Kasetsart University National Institute of Advanced Industrial Science and Technology

×