RNPBrazilian National Education and Research NetworkICT Support for Large-scale Science5th J-PAS collaboration meeting / 11-Sep-2012Leandro N. Ciuffo Alex S. Mouraleandro.firstname.lastname@example.org email@example.com 1
Qualified as a non-profit Social Organization(OS)maintained by federal public resources of • Government budget includes items to cover costs network and also RNP operating costs • Support by MCTI, MEC and now MinC (Culture) • RNP monitored by MCTI, CGU and TCU Nacional INPE - Instituto de Pesquisas Espaciais • Additional LNCC - Laboratório projects supported by sectorial funds directly or INPA - Instituto Nacional through Nacional de Computação management contract, plus MS (Health) da de Pesquisas • “Reserach Unit” of the Ministry of S&T Científica Amazônia http://www.mct.gov.br/index.php/content/view/741.html
Build and operate R&E networks• Maintenance and continued renewal of infrastructure • RNP backbone of 2000 has been renewed 3 times (2004, 2005 and 2011), with large increases in maximum link capacity from 25 Mbps to 10 Gbps (factor of 400) • Metro networks have been built in capital cities to provide access to Point of Presence (PoP) at 1 Gbps or more www.redecomep.rnp.br • International capacity has increased from 300 Mbps in 2002 to over 20 Gbps since 2009 (factor of 70). RNP has also played a major role in building the RedCLARA (Latin American regional) network, linking R&E networks from more than 12 countries – www.redclara.net • Testbed networks for network experimentation, especially project GIGA (with CPqD) since 2003 and the EU-BR FIBRE project (2011- 2014)
Ipê NetworkRNP Backbone Boa Vista Macapá Fortaleza Manaus Salvador Brasilia São Paulo Rio de Janeiro Bandwidth Commodity Internet RedCLARA (to Europe) Minimum…: 20Mbps Florianópolis Americas Light (to USA) Maximum..: 10Gbps Porto Alegre Aggregated: 250Gbps http://www.rnp.br/backbone/
Why your data-driven research is relevant to RNP?
Science paradigms evolution Data-intensive research unify theory, experiment and simulation at scale. “big data” Computational Simulations simulating complex phenomena. “in silico” Theoretical Modeling e,g, Keplers and Newtonw´s laws Empirical Science describing natural phenomena
Key components of a new researchinfrastructure Scientific portal Local services and workflows Big Data Processing Registering Discovering Publishers Harvesting Users & Indexing Publishing Information Visualization Data Repositories Instruments
Network Requirements Workshop1. What science is being done?2. What instruments and facilities are used?3. What is the process/workflow of science?4. How are the use of instruments and facilities, the process of science, and other aspects of the science going to change over the next 5 years?5. What is coming beyond 5 years out?6. Are new instruments or facilities being built, or are there other significant changes coming?
J-PASData transfer requirements(to be validated) ~270 TB/y CSIC images T80S 35 TB/y raw
Hybrid Networks• Since the beginning of the Internet, NRENs provide the routed IP service• Around 2002, NRENs have begun to provide two network services: routed IP (traditional Internet) end-to-end virtual circuits (a.k.a. “lightpaths”) – This lightpath service is intended for users with high QoS needs, usually guaranteed bandwidth, as is implemented by segregation between their traffic and the general routed IP traffic.• The GLIF organisation (www.glif.is) coordinates international
High bandwidth research connectivity (lightpaths for supporting international collaboration)GLIF world map, 2011 http://www.glif.is
GLIF links inSouth America • RNP networks • Ipê backbone (29,000 km) • metro networks in state capitals • GIGA optical testbed, from RNP and CPqD • links 20 research institutions in 7 cities (750 km) • KyaTera research network in S. Paulo • links research institutions in 11 cities (1500 km)
Why?• R&E networks in Brazil, and especially RNP, are funded by government agencies to provide quality network services to the national R&E community• In most cases, this is handled normally by providing R&E institutions with a connection to our networks, which operate standard Internet services of good quality.• However, there are times when this is not enough…
Network Requirements andExpectations • Expected transfer rates to transfer data • As a first step in improving your network performance, it is critical to have a baseline understanding of what speed you should expect from your network connection under ideal conditions. • The following shows how long it takes to transfer 1 Terabyte of data across various speed networks: 10 Mbps network: 300h (12.5 days) 100 Mbps network: 30h 1Gbps network: 3h 10Gbps network: 20min
Inadequate performance for criticalapplications In some cases, the standard Internet services are not good enough for high-performance or data-intensive projects. Sensitive to perturbations caused by security devices: - Numerous cases of firewalls causing problems - Often difficult to diagnose - Router filters can often provide equivalent security without the performance impact Science and Enterprise network requirements are in conflict 28
applied Tuning of networking software is generally necessary on high bandwidth and long latency data connections, because of the peculiarities of TCP implementations In the case of high QoS requirements it is often necessary to use lightpaths, to avoid interference with cross traffic In many cases, both these approaches are required 30
The Cipó Experimental Service• We are now beginning to deploy dynamic circuits as an experimental service on our network – This will also interoperate with similar services in other networks.
Getting support• If you need advice or assistance with these network problems, it is important to get in touch with network support 1. At your own institution 2. At your state network provider www.rnp.br/pops/index.php 3. In the case of an specific circuit (lightpath) services, you may contact RNP directly at firstname.lastname@example.org
Network Diagnostic Tool (NDT)• Test your bandwidth from your computer to the RNP’s PoP • São Paulo: http://ndt.pop-sp.rnp.br • Rio de Janeiro: http://ndt.pop-rj.rnp.br • Florianopolis: http://ndt.pop-sc.rnp.br
Recommended Approach• On a high-speed network it takes less time to transfer 1 Terabyte of data than one might expect.• It is usually sub-optimal to try and get 900 megabits per second of throughput on a 1 gigabit per second network path in order to move one or two terabytes of data per day. The disk subsystem can also be a bottleneck - simple storage systems often have trouble filling a 1 gigabit per second pipe.• In general it is not a good idea to try to completely saturate the network, as you will likely end up causing problems for both yourself and others trying to use the same link. A good rule of thumb is that for periodic transfers it should be straightforward to get throughput equivalent to 1/4 to 1/3 of a shared path that has nominal background load.• For example, if you know your receiving host is connected to 1 Gbps Ethernet, then a target of speed of 150-200 Mbps is reasonable. You can adjust the number of parallel streams (as described on the tools page) that you are using to achieve this.• Many labs and large universities are connected at speeds of at least 1 Gbps, and most LANs are at least 100 Mbps, so if you dont get at least32 20
Performance using TCP• There are 3 important variables (there are others) that affect TCP performance: packet loss, latency (or RTT - Round Trip Time), and buffer size/window. All are interrelated.• The optimal buffer size is twice the product bandwidth*delay of the link/connection: • buffer size = bandwidth x RTT• e.g.: if the result of ping if 50ms and the end-to-end network is all 1G or 10G Ethernet, the TCP receiving buffers (an operating system parameter) should be: • 0.05 seg x (1 Gbit / 8 bits) = 6.25 MBytes
TCP Congestion AvoidanceAlgorithms • The TCP reno congestion avoidance algorithm was the default in all TCP implementations for many years. However, as networks got faster and faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including: • reno: Traditional TCP used by almost all other operating systems. (default) • cubic: CUBIC-TCP • bic: BIC-TCP • htcp: Hamilton TCP • vegas: TCP Vegas • westwood: optimized for lossy networks • Most Linux distributions now use cubic by default, and Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp. • More details on can be found at: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
TCP Congestion AvoidanceAlgorithms many years. However, as networksthe default and all TCP • The TCP reno congestion avoidance algorithm was implementations for got faster in faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including: • reno: Traditional TCP used by almost all other operating systems (default) • cubic: CUBIC-TCP • bic: BIC-TCP • htcp: Hamilton TCP • vegas: TCP Vegas • westwood: optimized for lossy networks • Most Linux distributions now use cubic by default, and Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp. • More details on can be found at: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorith m
MTU Issues• Jumbo Ethernet frames can increase performance by a factor of 2-4.• ping tool can be used to verify the MTU size.• For example, on Linux you can do:• ping -s 8972 -M do -c 4 10.200.200.12• Other tools that can help verify the MTU size are scamper and tracepath
Say No to scp: Why you should avoid scp over aWAN • In a Unix environment scp, sftp, and rsync are commonly used to copy data between hosts. • While these tools work fine in a local environment, they perform poorly on a WAN. • The openssh versions of scp and sftp have a built in 1 MB buffer (previously only 64 KB in openssh older than version 4.7) that severely limits performance on a WAN. • rsync is not part of the openssh distribution, but typically uses ssh as transport (and is subject to the limitations imposed by the underlying ssh implementation). • DO NOT USE THESE TOOLS if you need to transfer large data sets across a network path with a RTT of more than around 25ms. • More information is here.
Why you should avoid scp over a WAN(cont.) • The following results are typical: scp is 10x slower than single stream GridFTP, and 50x slower than parallel GridFTP. • Sample Results Berkeley, CA to Argonne, IL (near Chicago). RTT = 53 ms, network capacity = 10Gbps.
A Simple Science DMZ• A simple Science DMZ has several essential components. These include dedicated access to high-performance wide area networks and advanced services infrastructures, high-performance network equipment, and dedicated science resources such as Data Transfer Nodes. Here is a diagram of a simple Science DMZ showing these components and data paths:
Science DMZ: Supercomputer CenterNetwork below illustrates a simplified supercomputer center • The diagram network. While this may not look much like the previous simple Science DMZ diagram, the same principles are used in its design.
Science DMZ: Big Data Site• For sites that handle very large data volumes (e.g. for big experiments such as the LHC), individual data transfer nodes are not enough.• Data transfer clusters are needed: groups of machines serve data from multi-petabyte data stores.• The same principles of the Science DMZ apply - dedicated systems are used for data transfer, and the path to the wide area is clean, simple, and easy to troubleshoot. Test and measurement are integrated in multiple locations to enable fault isolation. This network is similar to the supercomputer center example in that the wide area data path covers the entire network front-end.
Data Transfer Node (DTN)• Computer systems used for wide area data transfers perform far better if they are purpose-built and dedicated to the function of wide area data transfer. These systems, which we call Data Transfer Nodes (DTNs), are typically PC-based Linux servers built with high- quality components and configured specifically for wide area data transfer.• ESnet has assembled a reference implementation of a host that can be deployed as a DTN or as a high-speed GridFTP test machine.• The host can fill a 10Gbps network connection with disk-to-disk data transfers using GridFTP.• The total cost of this server was around $10K, or $12.5K with the more expensive RAID controller. If your DTN node is used only as a data cache using RAID0 instead of a reliable storage server using RAID5, you can get by with the less expensive RAID controller.• Key aspects of the configuration include: recent version of the
DTN Hardware Description• Chassis: AC SuperMicro SM-936A-R1200B 3U 19" Rack Case with Dual 1200W PS• Motherboard: SuperMicro X8DAH+F version 1.0c• CPU: 2 x Intel Xeon Nehalem E5530 2.4GHz• Memory: 6 x 4GB DDR3-1066MHz ECC/REG• I/O Controller: 2 x 3ware SAS 9750SA-8i (about $600) or 3ware SAS 9750-24i4e (about $1500)• Disks: 16 x Seagate 500GB SAS HDD 7,200 RPM ST3500620SS• Network Controller: Myricom 10G-PCIE2-8B2-2S+E• Linux Distribution • Most recent distribution of CentOS Linux • Install 3ware driver: http://www.3ware.com/support/download.asp • Install ext4 utilities: yum install e4fsprogs.x86_64
DTN Tuning• Add to /etc/sysctl.conf, then run sysctl -p• # standard TCP tuning for 10GE• net.core.rmem_max = 33554432• net.core.wmem_max = 33554432• net.ipv4.tcp_rmem = 4096 87380 33554432• net.ipv4.tcp_wmem = 4096 65536 33554432• net.ipv4.tcp_no_metrics_save = 1• net.core.netdev_max_backlog = 250000• Add to /etc/rc.local• #Increase the size of data the kernel will read ahead (this favors sequential reads)• /sbin/blockdev --setra 262144 /dev/sdb• /sbin/blockdev --setra 262144 /dev/sdc• /sbin/blockdev --setra 262144 /dev/sdd• # increase txqueuelen• /sbin/ifconfig eth2 txqueuelen 10000• /sbin/ifconfig eth3 txqueuelen 10000• # make sure cubic and htcp are loaded• /sbin/modprobe tcp_htcp• /sbin/modprobe tcp_cubic• # set default to htcp• /sbin/sysctl net.ipv4.tcp_congestion_control=htcp• # with the Myricom 10G NIC increasing interrupt coalencing helps a lot:• /usr/sbin/ethtool -C ethN rx-usecs 75
DTN Tuning (cont.)• Tools• Install a data transfer tool such as GridFTP - see the GridFTP quick start page. Information on other tools can be found on the tools page.• Performance Results for this configuration• Back-to-Back Testing using GridFTP• - memory to memory, 1 10GE NIC: 9.9 Gbps - memory to memory, 4 10GE NICs: 38 Gbps - disk to disk: 9.6 Gbps (1.2 GBytes/sec) using large files on all 3 disk partitions in parallel
References (1/3)• TCP Performance Tuning for WAN Transfers - NASA HECC Knowledge Base http://www.nas.nasa.gov/hecc/support/kb/TCP-Performance-Tuning-for-WAN-Transfers_137.html• Googles software-defined/OpenFlow backbone drives WAN links to 100 per cent utilization - Computerworld http://www.computerworld.com.au/article/427022/google_software- defined_openflow_backbone_drives_wan_links_100_per_cent_utilization/• Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams http://www.internet2.edu/presentations/jt2012winter/20120125-Pouyoul-JT-lighting.pdf• Tutorials / Talks• Achieving the Science DMZ: Eli Dart, Eric Pouyoul, Brian Tierney, and Joe Breen, Joint Techs, January 2012. (watch the webcast)• Tutorial in 4 sections: Overview and Archetecture, Building a Data Transfer Node,Bulk Data Transfer Tools and PerfSONAR, Case Study: University of Utahs Science DMZ• How to Build a Low Cost Data Transfer Node: Eric Pouyoul, Brian Tierney and Eli Dart, Joint Techs, July 2011.• High Performance Bulk Data Transfer: (includes TCP tuning tutorial), Brian Tierney and Joe Metzger, Joint Techs, July 2010.• Science Data Movement: Deployment of a Capability: Eli Dart, Joint Techs, January 2010.• Bulk Data Transfer Tutorial, Brian Tierney, September 2009• Internet2 Performance Workshop, current slides• SC06 Tutorial on high performance networking, Phil Dykstra, Nov 2006 43
References (2/3)• Papers• OReilly ONLamp Article on TCP Tuning• Tuning• PSC TCP performance tuning guide• SARA Server Performance Tuning Guide• Troubleshooting• Fermilab Network Troubleshooting Methodology• Geant2 Network Tuning Knowledge Base• Network and OS Tuning• Linux IP Tuning Info• Linux TCP Tuning Info• A Comparison of Alternative Transport protocols 44
References (3/3)• Network Performance measurement tools• Convert Bytes/Sec to bits/sec, etc.• Measurement Lab Tools• Speed Guides performance tester and TCP analyzer . (mostly useful for home users)• ICSIs Netalyzr• CAIDA Taxonomy• SLAC Tool List• iperf vs ttcp vs nuttcp comparison• Sally Floyds list of Bandwidth Estimation Tools• Linux Foundations TCP Testing Page• Others• bufferbloat.net: Site devoted to pointing out the problems with large network buffers on slower networks, such as homes or wireless. 45
Thank you / Obrigado! Leandro Ciuffo - email@example.com Alex Moura - firstname.lastname@example.org Twitter: @RNP_pd 53