Successfully reported this slideshow.
Your SlideShare is downloading. ×

Lessons learned from shifting real data around: An ad hoc data challenge from CERN to Lancaster

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 20 Ad

More Related Content

Similar to Lessons learned from shifting real data around: An ad hoc data challenge from CERN to Lancaster (20)

More from Jisc (20)

Advertisement

Recently uploaded (20)

Lessons learned from shifting real data around: An ad hoc data challenge from CERN to Lancaster

  1. 1. Lessons learned from shifting real data around; An Ad Hoc Data Challenge from CERN to Lancaster JISC Network Engineer Meeting 9th December 2022. Matt Doidge (Lancaster University), James Walder (STFC), Duncan Rand (Jisc/Imperial College London). with contributions from Gerard Hand and Steven Simpson (Lancaster University). Adapted from a talk given at the Autumn HEPiX 2022
  2. 2. Background ● We are a long established “WLCG Tier-2” gridsite in the UK, at Lancaster University. ● A member of the GridPP collaboration ● Provide ~8000 cores of compute and 10PB of storage to the WLCG ● Our primary users are the ATLAS experiment. ● Recently implemented a new storage solution that we wanted to test the performance of. ● Another recent change was the increase of the site’s connection to Janet from 10 Gb/s to 40 Gb/s (4x10). ● The storage (and the grid site as a whole) is outside the institution firewall (essentially a DMZ), and has an almost dedicated network link (it’s the University’s backup link). Disclaimer: The original version of this presentation was given to a group of High Energy Physics Sysadmins.
  3. 3. Alphabet Soup WLCG: World LHC Computing Grid, providing distributed computing resources to the LHC LHC: The Large Hadron Collider at CERN. GridPP: The Collaboration that provides the UK contribution to WLCG, representing the UK sites. Tier-X: The WLCG sites are tiered. 0==CERN, 1==National Centres (RAL for the UK), 2== Distributed Computing centres (within the UK these are at institutions). ATLAS: One of the main LHC experiments. FTS: File Transfer Service XRootD: eXtended Root Daemon, the endpoint we use to present our storage, popular in grid circles. Originated at Stanford about 20 years ago.
  4. 4. A Quick (to the point of being almost useless) guide to ATLAS Data Management infrastructure. ATLAS data is managed through two key components: ● RUCIO - this is the backbone, describing data sets and data locations, as well as being able to subscribe where data should be moved to. ○ Describes lists of rules and replications. ○ Growing in popularity outside of the WLCG, with several astronomy groups adopting it. ● FTS - the service that moves data around. Handles retries and consistency checks, and controls rates. ○ Orchestrates transfers through third-party copies between remote storage endpoints. ○ Curates number of simultaneous transfers and transfer rates, to protect against link or endpoint saturation. ■ Will scale transfer rates up and down on success/failure rates. ● Tim pointed out this makes it like “Super TCP”.
  5. 5. The FTS at work.
  6. 6. XRootD endpoint? ● XRootD is one of the choices available for “exposing” WLCG (or any) storage to the world. ○ Very much developed in the Particle Physics community. ○ Whilst it has its own protocol (xroot) it can also provide https/davs endpoints. ○ Traditional used x509 authentication (as was the WLCG standard), but with support for modern “bearer token” style authentication. ○ Happily transfers data over IPv4 and v6. ○ Popular as it accepts authenticated third-party https/davs transfers, which are the preferred transfer method within WLCG. ○ Simple (in principle) to use to securely expose POSIX-like storage to the outside world (mount the volume on the xroot server, just be sure to set up your authz/authn correctly…) ○ Easily scalable as it can be set up to natively redirect to other xroot servers, with in built load balancing. XRootD is also creeping outside of Particle Physics into other communities.
  7. 7. Lancaster XRootD setup ● Lancaster’s Storage is a large 10PB CEPHFS volume, presented to the world via mounts on our xroot servers. ● At the start of the tests it consisted of two servers - a redirector/data server and a data server. ○ *spoiller* midway through we added another modest box. ● xroot server 1 (redirector and data server): Midrange Dell “Pizza Box”, 24 Cores (48 Threads), 128GB RAM, 2 x 25Gb Broadcom NICs (1 world facing, 1 internal mounting CEPHFS) ● xroot server 2 (data server): Another, slightly more modest pizza box (technically a retasked NFS server), Same CPU, 96GB of RAM (this became a factor), same networking.
  8. 8. Motivation. We’ve had two big changes that could affect the rate at which we can read (and write) data to (or from) our site in unknown ways, a new storage solution and a network link upgrade. With that in mind: How much data can we shovel to our site? ● We have perfSONAR, but this only gives us a limited picture with relation to our data transfer capacity. ● Manual tests tend to only show functionality, not capacity. ● ATLAS have the infrastructure to push around a lot of data. Where’s the inevitable bottleneck? ● The site network link (40Gb/s)? ● The XRootD server’s network? (This would be N x 25 Gb, where N is the number of servers) ● CEPHFS write speed? (dd tests where NIC limited, suggesting this won’t be the case). ● The XRootD server’s capacity to “process” transfers? Are we monitoring everything we should monitor? ● System Load, Network Load, Log file error message flags.
  9. 9. Method ● Working with Duncan Rand, James Walder (from RAL, the UK Atlas rep) subscribed 100TB of data (4 ~25TB datasets) located at CERN to Lancaster. ○ This had to be approved by the ATLAS DDM team. ○ The data sets purposely consisted of a mix of large and less-large files (3 had ~9GB, 1 had ~4GB). ○ Approximately 9k of the 9GB files, and 6k of the 4GB files. ○ The transfer was predicted to take a few days, allowing us time to react, adjust and review. ● When transfers were ready the number of simultaneous FTS transfers between CERN and Lancaster were increased, from 100 to 200 max transfers (i.e. concurrent active jobs). ○ This encouraged FTS to throw as much data as we could our way. ● Our plan was to then sit back and watch the FTS plots, keeping each other informed of our observations. …I’m not entirely surely that’s enough to qualify as a plan!
  10. 10. The Plots (test period ~8am 14th Sept to ~7pm 15th Sept) https://monit-grafana.cern.ch/d/veRQSWBGz/fts-servers-dashboard?orgId=25&var-group_by=vo&var-vo=atlas&var- src_country=All&var-dst_country=All&var-src_site=CERN-PROD&var-source_se=All&var-dst_site=UKI-NORTHGRID-LANCS- HEP&var-dest_se=All&var-fts_server=All&var-staging=All&var-bin=1h&var-dest_hostname=All&var- source_hostname=All&from=1662948155011&to=1663434668644 The maximum data rate over the first day was 1.07 GB/s (8Gb), with an average of 0.75 GB/s. (we’re also aware that our efficiency isn’t ever very good even outside of the test period).
  11. 11. What were we seeing? ● The FTS was regularly throttling back transfers as it hit the efficiency threshold. ○ FTS makes decisions on whether to ramp up or down the number of concurrent transfers by the efficiency (i.e. % failed transfers). There are other factors, but this was the one we hit. ○ Sadly the decision level isn’t indicated in these plots, but it hovers at ~90% level. ● Transfers were failing due to “checksum timeout” errors. ○ Referring to the post-transfer checksum of the written files as they are on the filesystem, which is a computational and IOP heavy task. ○ This is not te first time we say this error (having seen during first commissioning of the service) but we thought we had tuned it out through tweaking server settings. ● The xrootd servers themselves were heavily loaded, with corresponding errors in the logs. Solution: Try to increase the checksumming capacity for our servers.
  12. 12. Mid-test Tuning (First Day) (1) ● With the “checksumming capacity” being the heavily limiting factor we tried to tune (i.e. raise) the relevant xrootd settings: ○ xrootd.chksum max XX (the number of simultaneous checksum operations) ○ ofs.cksrdsz XXXXm (AIUI the size, in MB, of each file chunk stored in RAM for the operation). ■ The total memory consumed by checksumming operations = Nchksums * Size ● These have had to be raised before, in “normal” operations, due to the quite low defaults. ● There is also the chksumming digest used - the inbuilt xrootd adler32 in our case. ○ There is a possibility to replace this with a custom adler32 script, but this was a bigger change then we could make in a short space of time.
  13. 13. Mid-test Tuning (First Day) (2) ● However we took things too far just cranking up the number of concurrent checksum and the amount of RAM devoted to them, and started getting out of memory errors on our xrootd servers. Hence we had to throttled back. ○ The “sweet spot” appears to be: ■ maxchksums * cksrdsz < MachineRAM/2 (i.e. less then half the machine RAM for checksums) ■ maxchksum ≈ nMachineThreads. ● Even after increasing these numbers as high as they could go without affecting performance we did not see much of an improvement in the failures rates. ○ Essentially they were about as “tuned” as they could be.
  14. 14. Throwing more servers at the problem (12.00 2nd Day) We had one drained DPM disk server laying around, originally earmarked for a CEPH testbed. So we quickly reinstalled it and threw it in to production as a third xrootd server. The results were almost immediate and quite pronounced (although somewhat expected) - with the maximum rate jumping to 1.58 GB/s (average 1.09 GB/s).
  15. 15. Conclusions 1 The bottleneck for our site appears to be the lack of CPU to process checksums in a timely manner. ● A back-of-an-envelope guesstimate suggests to take in data at a rate of ~40Gb/s we would need approximately 6-8 medium quality servers (~24 cores, 128GB of RAM, 25Gb NICs). ○ This is a lot more than we the 3-4 xrootd servers we originally planned (and that was for redundancy, not capacity). ○ There is a concern that at higher numbers of frontend servers the load on the CEPHFS system would be non-negligible ■ Data is written, then re-read for the checksumming process. ■ Lucky for us we have a few in-warranty servers from our previous storage that can eventually fill this gap. ● Improvements could possibly be made if we could write a plugin to have xrootd checksum “over the wire”. ○ This would also have the benefit of reducing the load of our xroot gateways on the underlying filesystem. ○ However “over the wire” checksums rely on the data being written sequentially, which is not a guarantee for posix filesystems (or multi-threaded transfers). ● Of course there may be other config changes to improve the picture, but we believe that these will be small scale benefits..
  16. 16. Conclusion 2 The test was incredibly useful, and revealed problems that would have only otherwise shown themselves at the height of production or in some large-scale data challenge. We plan to have a repeat performance after we’ve got a few more redirectors in place and polished our setup a bit more. ● But future tests don’t need to be of the same scale - 100TB was almost certainly more then we needed. ● With coordination of the start (so data starts moving during office hours) we reckon ~10TB would be more than enough to provide useful statistics. ● The next test it would be advantageous to keep a “log” of changes and restarts. ○ service restarts noticeably muddy the transfer results, even if they’re brief. ● Also I will take more screenshots - some data has a short lifespan (so I missed capturing it for this presentation). ● Of course we don’t want to just fill up CERN FTS traffic with test transfers…
  17. 17. Monitoring Improvements Another positive outcome of the exercise is that it helped us polish our monitoring. Plots made using Prometheus + Grafana
  18. 18. Summary ● Much like how you can’t beat “real” job flows for testing compute, there is no substitute for real data flows for testing your storage. ○ Just crank things up to 11. ● The ATLAS data management system can easily provide this tool. ○ But please liaise with DDM and your local atlas reps first! ● For us, the bottleneck was not our network bandwidth or our file system performance, but our gateway CPU. ○ Not quite what we expected. ○ More gateway servers would help, but “over the wire” checksumming could help more.
  19. 19. Moral of the Story Big pipes don’t guarantee good data flow. OR It’s not always the network’s fault. OR Don’t skimp on your endpoint hardware.
  20. 20. References: ● Lancaster Xroot Config Github: https://github.com/mdoidge/lancsxroot ● Monit FTS Dashboard (maybe be not be available to all). ● FTS documentation: https://fts.web.cern.ch/fts/ ● xrootd documentation: https://xrootd.slac.stanford.edu/docs.html ○ In particular: ○ https://xrootd.slac.stanford.edu/doc/dev54/xrd_config.htm#_Toc88513998

×