Your SlideShare is downloading. ×
glideinWMS validation scirpts - glideinWMS Training Jan 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

glideinWMS validation scirpts - glideinWMS Training Jan 2012

467
views

Published on

Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations. …

Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
467
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. glideinWMS Training @ UCSD GlideinWMS Validation scripts by Igor Sfiligoi (UCSD)UCSD Jan 18th 2012 Validation Scripts 1
  • 2. Overview ● Why validation scripts ● Anatomy of validation scripts ● Types of validation scriptsUCSD Jan 18th 2012 Validation Scripts 2
  • 3. Reminder - Glideins ● A glidein is just a properly configured Condor execution node submitted as a Grid job ● glideinWMS Central manager provides Collector CREAM glidein Execution node automation glidein Execution node Negotiator Submit node Submit node glidein Execution node Submit node Execution node glidein Schedd Startd Globus Job glideinWMSUCSD Jan 18th 2012 Validation Scripts 3
  • 4. Reminder – Glidein script ● Glidein startup script just a empty shell that: ● Downloads scripts, parameters and Condor bins ● Runs the scripts in order ● Does the final cleanup ● Two types of script: If any of these fail, ● Node validation Condor will never be started ● Condor configuration and startup Once Condor starts, glideinWMS is out of the wayUCSD Jan 18th 2012 Validation Scripts 4
  • 5. As a consequence If validation scripts finds a bad WN Condor will not be started No user jobs will ever fail hereUCSD Jan 18th 2012 Validation Scripts 5
  • 6. Is validating at glidein startup a good idea? ● Advantages: Users happy ● User jobs never land on “broken” nodes ● Failures logged Factory admins can act on this info, notifying sites (who can fix the problem) ● Limitations: Condor provides ● Tested only at glidein startup cron-like capabilities for this – If node “goes bad” after Condor startup, user jobs will still be fetched and will fail Can be solved by ● Problems: passing the test and setting attributes ● Failed validation → wasted CPU – Some jobs may still succeed, But this will hide problem even if validation failed from FactoryUCSD Jan 18th 2012 Validation Scripts 6
  • 7. Anatomy of a validation scriptUCSD Jan 18th 2012 Validation Scripts 7
  • 8. Validation scripts 101 ● Any executable will do! ● There are no restrictions ● Can be compiled binary or a shell script ● Exit code checked ● ==0 - Success ● !=0 - Failure ● And, to the first approximation, this is allUCSD Jan 18th 2012 Validation Scripts 8
  • 9. Validation scripts - I/O ● You may want to: ● Get some input ● Have some output ● Both handled through a dashboard file ● Filename passed as the only argument to the validation scriptsUCSD Jan 18th 2012 Validation Scripts 9
  • 10. Dashboard file ● Simple list of (key, value) pairs ● One per line Newline not allowed in either key or value ● Space separated Space not allowed in the key ● Hash (#) can be used for comments GLIDEIN_Factory UCSD GLIDEIN_Factory UCSD GLIDEIN_Name Production_v4_2 GLIDEIN_Name Production_v4_2 GLIDEIN_Entry_Name CMS_T2_US_UCSD_gw2 GLIDEIN_Entry_Name CMS_T2_US_UCSD_gw2 GLIDECLIENT_Name UCSD-v5_3.main GLIDECLIENT_Name UCSD-v5_3.main GLIDEIN_WORK_DIR /data10/condor_local/execute/dir_22668/glide_B22745/main GLIDEIN_WORK_DIR /data10/condor_local/execute/dir_22668/glide_B22745/main GLIDEIN_Glexec_Use OPTIONAL GLIDEIN_Glexec_Use OPTIONAL X509_CERT_DIR /wn-client/globus/TRUSTED_CA X509_CERT_DIR /wn-client/globus/TRUSTED_CA GLIDEIN_Site UCSD GLIDEIN_Site UCSD # This was calculated on the fly # This was calculated on the fly CCB_ADDRESS glidein-collector.t2.ucsd.edu:9822 CCB_ADDRESS glidein-collector.t2.ucsd.edu:9822 http://tinyurl.com/glideinWMS/doc.prd/factory/custom_scripts.html#glidein_configUCSD Jan 18th 2012 Validation Scripts 10
  • 11. Reading input ● Dashboard file as the first argument ● Then just look for the key and split on space # here is my dashboard file # here is my dashboard file glidein_config=$1 glidein_config=$1 # I expect only one key and no space in the value # I expect only one key and no space in the value glexec_bin=`awk /^GLEXEC_BIN /{print $2} $glidein_config` glexec_bin=`awk /^GLEXEC_BIN /{print $2} $glidein_config` if [ -z "$glexec_bin" ]; then if [ -z "$glexec_bin" ]; then exit 1 exit 1 fi fi … … exit 0 exit 0UCSD Jan 18th 2012 Validation Scripts 11
  • 12. Writing output ● You can just append to the file ● Just make sure it is properly formatted # here is my dashboard file # here is my dashboard file glidein_config=$1 glidein_config=$1 … … # tell condor to use glexec # tell condor to use glexec echo GLEXEC_JOB True >> $glidein_config echo GLEXEC_JOB True >> $glidein_config exit 0 exit 0 ● You should also make sure the same key is not already definedUCSD Jan 18th 2012 Validation Scripts 12
  • 13. Helper function ● glideinWMS provides a helper BASH function to avoid duplicate keys ● External SH file, referenced as ADD_CONFIG_LINE_SOURCE ● The function name inside is add_config_line # here is my dashboard file (MUST be called glidein_config) # here is my dashboard file (MUST be called glidein_config) glidein_config=$1 glidein_config=$1 # get helper function # get helper function add_config_line_source= add_config_line_source= `awk /^ADD_CONFIG_LINE_SOURCE /{print $2} $glidein_config` `awk /^ADD_CONFIG_LINE_SOURCE /{print $2} $glidein_config` source $add_config_line_source source $add_config_line_source … … # tell condor to use glexec # tell condor to use glexec add_config_line GLEXEC_JOB True add_config_line GLEXEC_JOB TrueUCSD Jan 18th 2012 Validation Scripts 13
  • 14. Influencing Condor behavior ● By default, keys in dashboard file ignored by Condor startup/configuration script ● Anything you write into it, it is just for your consumption (e.g. for other scripts of yours) ● A special whitelist file lists the keys that should be passed to Condor ● Referenced as CONDOR_VARS_FILE Again, source ADD_CONFIG_LINE_SOURCE ● Helper function available add_condor_vars_lineUCSD Jan 18th 2012 Validation Scripts 14
  • 15. Condor Vars file ● Each line contains a key ● Seven fields, space (or tab) separated ● Key ● Type - I – Integer, S – String, C – Expr. ● Default value - “-” for no default ● Condor Name - “+” = Key name Useful when others have to define it ● Is it required? - Y|N ● Should be exported to ClassAd? - Y|N ● Should be exported to job environment? - “-” no, “+” Key name, “@” Condor Name http://tinyurl.com/glideinWMS/doc.prd/factory/custom_scripts.html#condor_varsUCSD Jan 18th 2012 Validation Scripts 15
  • 16. Example # here is my dashboard file (MUST be called glidein_config) # here is my dashboard file (MUST be called glidein_config) glidein_config=$1 glidein_config=$1 # extract where to find the vars file # extract where to find the vars file # (MUST be called condor_vars_file) # (MUST be called condor_vars_file) condor_vars_file= condor_vars_file= `awk /^CONDOR_VARS_FILE /{print $2} $glidein_config` `awk /^CONDOR_VARS_FILE /{print $2} $glidein_config` # get helper function # get helper function add_config_line_source= add_config_line_source= `awk /^ADD_CONFIG_LINE_SOURCE /{print $2} $glidein_config` `awk /^ADD_CONFIG_LINE_SOURCE /{print $2} $glidein_config` source $add_config_line_source source $add_config_line_source … … # This should already have been set # This should already have been set add_condor_vars_line "GLEXEC_BIN" "C" "-" "GLEXEC" "Y" "N" "-" add_condor_vars_line "GLEXEC_BIN" "C" "-" "GLEXEC" "Y" "N" "-" # tell condor to use glexec # tell condor to use glexec add_config_line GLEXEC_JOB True add_config_line GLEXEC_JOB True add_condor_vars_line "GLEXEC_JOB" "C" "True" "+" "Y" "Y" "-" add_condor_vars_line "GLEXEC_JOB" "C" "True" "+" "Y" "Y" "-" # tell user where is the TMPDIR # tell user where is the TMPDIR add_config_line GLEXEC_TMP $TMPDIR add_config_line GLEXEC_TMP $TMPDIR add_condor_vars_line "GLEXEC_TMP" "S" "-" "+" "Y" "Y" "+" add_condor_vars_line "GLEXEC_TMP" "S" "-" "+" "Y" "Y" "+"UCSD Jan 18th 2012 Validation Scripts 16
  • 17. Error messages ● Your script found a problem ● Now what? ● You definitely want to exit with errno !=0 ● But, please, also print an error message! ● With enough information to understand why the script failed ● Will allow the Factory admins to act on itUCSD Jan 18th 2012 Validation Scripts 17
  • 18. Planned improvements (still speculation at this point) ● Current error codes and messages arbitrary ● Mostly good enough for manual debugging ● But cannot really automatically act on them ● Want to add some more structure ● Based on OSG Common Output Format proposal https://twiki.grid.iu.edu/bin/view/SoftwareTools/CommonTestFormat#Alain_s_proposal_Version_4_evolu ● In addition to exit code, If file not present, scripts expected to write a status file will assume “Error unknown” ● Which will be read and interpreted by the caller and propagated to the Factory Now we can start thinking about automatically acting on errors!UCSD Jan 18th 2012 Validation Scripts 18
  • 19. Standardized error reasons (preliminary - still speculation at this point) ● To allow for automated feedback, need standardized error reasons ● This is what I currently envision: ● Config - e.g. Impossible combinations ● Corruption - e.g. SHA1 check failed ● WN Resource - e.g. Disk full or glexec not found ● Network - e.g. Cannot talk to VO Collector ● VO Proxy - e.g. Proxy too short ● VO Data - e.g. VO SW not installedUCSD Jan 18th 2012 Validation Scripts 19
  • 20. Examples (preliminary - still speculation at this point)<?xml version="1.0"?> <?xml version="1.0"?><OSGTestResult id="glideinWMS.check_disk" version="7.5.4"> <OSGTestResult id="glideinWMS.check_disk" version="7.5.4"> <result> <result> <status>OK</status> <status>OK</status> <metric name="diskspace" ts="2012-01-12T15:02:20" <metric name="diskspace" ts="2012-01-12T15:02:20" uri="local">/tmp/glidein_15432/</metric> uri="local">/tmp/glidein_15432/</metric> </result> </result> <detail>Enough disk space found.</detail> <detail>Enough disk space found.</detail></OSGTestResult> </OSGTestResult> <?xml version="1.0"?> <?xml version="1.0"?> <OSGTestResult id="glideinWMS.check_proxy" version="7.5.4"> <OSGTestResult id="glideinWMS.check_proxy" version="7.5.4"> <result> <result> <status>FAILED</status> <status>FAILED</status> <metric name="failure" ts="..." uri="local">VO Proxy</metric> <metric name="failure" ts="..." uri="local">VO Proxy</metric> <metric name="proxy" ts="2012-01-12T15:02:21" <metric name="proxy" ts="2012-01-12T15:02:21" uri="local">/tmp/glidein_15432/proxy/a.proxy</metric> uri="local">/tmp/glidein_15432/proxy/a.proxy</metric> </result> </result> <detail>Proxy had less than 12h left.</detail> <detail>Proxy had less than 12h left.</detail> </OSGTestResult> </OSGTestResult> UCSD Jan 18th 2012 Validation Scripts 20
  • 21. Validation script typesUCSD Jan 18th 2012 Validation Scripts 21
  • 22. Why should you use VS? ● Of course: What we discussed until now ● Check for obviously broken nodes ● But also: ● To discover and advertise dynamic information ● Non-trivial configuration ● Site-specific customizationsUCSD Jan 18th 2012 Validation Scripts 22
  • 23. Dynamic information ● Some information dynamic by nature ● E.g. location of VO software ● You want to discover at run-time where it is located ● And fail, if you cannot find it! ● Makes life easier for the users ● Once discovered, good practice to advertise it ● In either/both the ClassAd and/or job environmentUCSD Jan 18th 2012 Validation Scripts 23
  • 24. Example # check if CMSSW installed locally # check if CMSSW installed locally if [ -f "$CMSSW" ]; then if [ -f "$CMSSW" ]; then source "$CMSSW" source "$CMSSW" If [ -z “$CMSSW_LIST” -o -z "$CMSSW_LOC" ]; then If [ -z “$CMSSW_LIST” -o -z "$CMSSW_LOC" ]; then echo "Corrupted CMSSW at $CMSSW!n" 1>&2 echo "Corrupted CMSSW at $CMSSW!n" 1>&2 exit 1 exit 1 fi fi else else echo "CMSSW not found!n" 1>&2 echo "CMSSW not found!n" 1>&2 exit 1 exit 1 fi fi # publish to user job env # publish to user job env add_config_line "CMSSW_LOC" "$CMSSW_LOC" add_config_line "CMSSW_LOC" "$CMSSW_LOC" add_condor_vars_line "CMSSW_LOC" "S" "-" "+" "Y" "N" "+" add_condor_vars_line "CMSSW_LOC" "S" "-" "+" "Y" "N" "+" # publish to Condor # publish to Condor add_config_line "CMSSW_LIST" "$CMSSW_LIST" add_config_line "CMSSW_LIST" "$CMSSW_LIST" add_condor_vars_line "CMSSW_LIST" "S" "-" "+" "Y" "Y" "-" add_condor_vars_line "CMSSW_LIST" "S" "-" "+" "Y" "Y" "-" exit 0 exit 0UCSD Jan 18th 2012 Validation Scripts 24
  • 25. Non-trivial configuration (Not really a “validation” script) ● You may want to generate some data on the fly ● e.g. a random seed let s=$RANDOM%123+17 let s=$RANDOM%123+17 add_config_line "MY_SEED" “$s” add_config_line "MY_SEED" “$s” add_condor_vars_line "MY_SEED" "I" "-" "+" "Y" "N" "+" add_condor_vars_line "MY_SEED" "I" "-" "+" "Y" "N" "+" ● And sometimes it is just inconvenient to specify some values in the frontend XML file ● e.g a long list l="1" l="1" for ((i=2; $i<100; i++)); do for ((i=2; $i<100; i++)); do l="$l:$i" l="$l:$i" done done add_config_line "MY_LIST" “$l” add_config_line "MY_LIST" “$l” add_condor_vars_line "MY_LIST" "S" "-" "+" "Y" "N" "+" add_condor_vars_line "MY_LIST" "S" "-" "+" "Y" "N" "+"UCSD Jan 18th 2012 Validation Scripts 25
  • 26. Site specific customization ● Currently, the frontend XML file does not allow site-specific customizations ● Unless you want to have a group per site! Limiting, since only one level of groups ● And there is the option for you to arrange for the Factory to provide it for you Maintenance will be a mess ● You can code the per-site config into a “validation script” Still not ideal, but may be better than the alternativeEspecially, if you can apply a rule with few exceptionsUCSD Jan 18th 2012 Validation Scripts 26
  • 27. Example glidein_config=$1 glidein_config=$1 site=`awk /^GLIDEIN_CMSSITE /{print $2} $glidein_config` site=`awk /^GLIDEIN_CMSSITE /{print $2} $glidein_config` country=`echo $site| awk {print substr($1,8,2)}` country=`echo $site| awk {print substr($1,8,2)}` if [ "$country" == "US" ]; then if [ "$country" == "US" ]; then myvar="OSG" myvar="OSG" elif [ "$country" == "IT" -o "$country" == "FR" ]; then elif [ "$country" == "IT" -o "$country" == "FR" ]; then myvar="EGI" myvar="EGI" else else echo "Cannot run in $country" 1>&2 echo "Cannot run in $country" 1>&2 exit 1 exit 1 fi fi add_config_line "MY_VAR" "$myvar" add_config_line "MY_VAR" "$myvar" add_condor_vars_line "MY_VAR" "I" "-" "+" "Y" "N" "+" add_condor_vars_line "MY_VAR" "I" "-" "+" "Y" "N" "+"UCSD Jan 18th 2012 Validation Scripts 27
  • 28. LimitationsUCSD Jan 18th 2012 Validation Scripts 28
  • 29. Limits of validation scripts ● Whatever is discovered on the WN is ● Used by the script for its own testing ● At best, propagated to glidein ClassAd or job env ● The discovered info cannot be used for Frontend matchmaking purposes! ● At best, for Negotiator matchmaking ● As a result, you may be requesting glideins that will never run any user jobs If condition common to ● Wither fail validation or do not match all WNsUCSD Jan 18th 2012 Validation Scripts 29
  • 30. What can you do? ● How do you notice it? ● If validation errors – The Factory admins will likely contact you ● If Negotiator not matching jobs – You will need to discover it yourself ● What to do afterwards? Maybe you were just too aggressive? ● Tune the script Pretty much a hack! ● Manually blacklist a site is your frontend XML ● Or convince the Factory admins to advertise VO specific info Can be hard to maintain long termUCSD Jan 18th 2012 Validation Scripts 30
  • 31. The EndUCSD Jan 18th 2012 Validation Scripts 31
  • 32. Pointers ● The official glideinWMS project Web page is http://tinyurl.com/glideinWMS ● glideinWMS development team is reachable at glideinwms-support@fnal.gov ● The OSG glidein factory is reachable at osg-gfactory-support@physics.ucsd.eduUCSD Jan 18th 2012 Validation Scripts 32
  • 33. Acknowledgments ● The glideinWMS is a CMS-led project developed mostly at FNAL, with contributions from UCSD and ISI ● The glideinWMS factory operations at UCSD is sponsored by OSG ● The funding comes from NSF, DOE and the UC systemUCSD Jan 18th 2012 Validation Scripts 33