CERN has provided open access to a vast amount of data. Links to these datasets can be affected over time. In this slides, I present how I checked that there were no broken links to the CMS Experiment datasets.
11. CERN Open Data Portal contains
datasets of real data recorded with
the CMS detector
12.
13. Inside the datasets there is
information of different events and
physics object collections
14.
15. The information is stored in the aforementioned
AOD files, which are conveniently listed in index
files
16.
17.
18. PROBLEM
Checking the CMS datasets 09/16
That's the problem!
With such a large list of files,
deletion, corruption or loss of data could occur
How do we make sure that is not happening now?
19. PROPOSED SOLUTION
Checking the CMS datasets 10/16
GFAL (Grid File Access Library) version 2
provides useful command line tools…
gfal-ls is equivalent to the system ls command
Support protocol (root://)
-l option allows long listing format (including size)
20. PROPOSED SOLUTION
Checking the CMS datasets 11/16
On the other hand…
JSON (JavaScript Object Notation) version of the index
files provides more organized and detailed information
about the ROOT files of each dataset
… just do the replacement .txt → .json
21.
22. There we can find the size that we expect the ROOT
file to have in case there have been no
manipulations
23. A SMALL EXAMPLE
Checking the CMS datasets 13/16
gfal-ls will tell us if the file is still available…
It exists (expected result)
gfal-ls error: 2 (No such file or directory) - Failed to
stat file (No such file or directory)
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Ele
ctron/PATtuples/Electron_PAT_data_500files_1.root
It has not been found (Houston, we have a problem)
$ gfal-ls root://eospublic.cern.ch//eos/opendata/cms/Run2010B/
Electron/PATtuples/Electron_PAT_data_500files_1.root
Possibleresults
24. A SMALL EXAMPLE
Checking the CMS datasets 14/16
If it exists, we must check the size using gfal-ls -l,
comparing it with the JSON file
Result
$ gfal-ls -l root://eospublic.cern.ch//eos/opendata/cms/
Run2010B/Electron/PATtuples/Electron_PAT_data_500files_1.root
-r-------- 1 1399 125433 12117591860 Sep 4 2014
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Electron
/PATtuples/Electron_PAT_data_500files_1.root
Do you remember the JSON file?
25.
26. RESULTS
Checking the CMS datasets 16/16
files with
broken link
index files with
reading problems