Your SlideShare is downloading. ×
Id0115
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Id0115

273
views

Published on

Fashion, apparel, textile, merchandising, garments

Fashion, apparel, textile, merchandising, garments

Published in: Business, Lifestyle

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
273
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab
  • 2. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 3. Introduction
    • SAM is a Data Handling System for HEP: the project was started in 1997 by DZero
    • SAM-Grid project started in 2001-2002 to handle DZero’s expanded needs for globally distributed computing
    • CDF joined SAM-Grid at the end of 2002
    • JIM complements the data handling system (SAM) with J ob and I nfo M anagement: SAM-Grid = JIM + SAM
    • JIM is funded by PPDG and GridPP
    • Participated at SC02 and SC03
  • 4. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 5. Job Management JOB Computing Element Submission Client User Interface Queuing System User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Storage Element Computing Element Storage Element Data Handling System Data Handling System Storage Element Storage Element Information Collector Grid Sensors Grid Sensors Computing Element Data Handling System Data Handling System
  • 6. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 7. Running jobs on Grid resources: the trend
    • Grid resources are not dedicated to a single experiment
    • Translation:
      • no daemons running on the worker nodes of a Batch System
      • no experiment specific software installed
  • 8. Running jobs on Grid resources: today
    • The situation is transitioning:
      • Generally, experiments can install specific services on a node close to the cluster.
      • Worker nodes typically access the software via shared FS: not scalable!
      • Local resource configuration still too diverse to easily plug into the Grid
    • Today, most of our efforts are directed to coping with (the lack of) standard local fabric services
  • 9. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 10. Motivation
    • Problem: “standard” grid batch system adapters (globus job-managers) are too restrictive to fit all the local configurations
    • Examples:
      • the terms of the agreement for using the batch system can be expressed with special directives to the batch system
      • system administrators end up writing wrappers around the standard batch system commands
  • 11. SAM Batch System Adapter
    • We factor out the local batch system configuration using an intermediate layer that abstracts the basic interactions with the batch system
      • submit command
      • lookup command
      • remove command
    • For each of the commands above, the administrator can specify how to parse the output to fish out the relevant information e.g. local job id when submitting
    • We have written JIM globus job managers that use this layer
  • 12. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 13. Motivation
    • Portability of the software for DZero and CDF is still a problem not completely solved.
    • Most of the CDF and DZero applications still rely on the offline software to be preinstalled at the site.
    • Administrators need to install and maintain the software at each site
    • A job submitted to the grid must be able to execute at a site where its dependencies are installed
  • 14. Old solution: software advertisement
    • Administrators install the software at each site
    • The JIM advertisement framework senses the new product and advertises it to the broker as one of the characteristics of the site
    • Drawbacks:
      • the administrators still need to install the software
      • increased complexity of the advertisement framework: it needs to know how to detect the list of installed products
      • increased complexity of the broker: it needs to enforce the matching to the eligible sites
      • jobs running on old software versions may not find an eligible site
  • 15. New solution: dynamic software retrieval
    • Product developers store the software into SAM with appropriate metadata
    • Before running a job at a site, the infrastructure asks SAM for the delivery of the dependent products
    • The products live in the SAM cache and are automatically managed
    • Drawbacks:
      • increased complexity of local job submission
  • 16. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 17. Nomenclature
    • Input sandbox:
      • from the client (user sandbox):
        • the executable
        • configuration files
        • special dependencies (libraries, products,…)
      • from the local site
        • the product dependencies
    • Output sandbox:
      • stdout, stderr
      • log files
      • small custom output (e.g. histograms)
  • 18. Requirements
    • We want an infrastructure that:
      • Locally stores the user sandbox (from the Grid) at the site
      • transports and installs the input sandbox to the worker node
      • packages the output and hands it over to the Grid
  • 19. Limitations to overcome
    • the file transport mechanism of a batch system is site specific and needs to be factored out
    • shared file systems have scalability limits: we want to rely on them as little as possible
    • the worker nodes may have connectivity restrictions (firewalls)
  • 20. The sandbox management 1
    • It creates a sandbox area (reorganizing the native globus gass cache)
    • It starts up a gridftp server for the communications between worker nodes and head node (no shared FS)
    • It requests the delivery of the product dependencies
    • It creates a self extracting archive that contains the gridftp client and a bootstrapping script; when running, this transfers and installs the product dependencies, then passes control to the application
  • 21. The sandbox management 2
    • It submits to the batch system parallel instances of the self extracting archive
    • The job relies on SAM for large input/output files transfers
    • When the job finishes, stdout/stderr + custom output is packaged at the head node to be transported back to the submission site via grid mechanisms
  • 22. Open problems
    • Not all the batch system allow the selection of a node with sufficient scratch space to install the needed software
    • We would greatly simplify this infrastructure if there were a “standard” local storage service at all the sites (e.g. DiskFarm)
  • 23. Overview
    • Introduction
    • The grid-level services: an overview
      • Job Management
    • The fabric-level services
      • Local batch system adaptation
      • Dynamic product retrieval
      • Local sandbox management
      • Job complex-status logging
  • 24. Motivation
    • Distributed logging of job status/history
    • Web monitoring
    • Statistics on historical data
    • Grid scheduling based upon job status/history at a certain site
  • 25. The XML DB Status Logger
    • The status of the job is reported to an XML database deployed at each execution site
    • The information comes from the local batch system (simple job status e.g. “idle”, “running”, …) AND from the application (complex status e.g. “Processing executable X in the chain”)
    • The XML database gives flexible remote access via standard mechanisms, such as XPath
  • 26. Conclusions
    • The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info Management
    • The SAM-Grid adopts Fabric-level configurable solutions for batch system adaptation, product delivery, sandboxing and job complex-status logging
    • The community needs to come up with standard fabric-level services to make any Grid usable
  • 27. More info at…
    • http://www-d0.fnal.gov/computing/grid/
    • http://samgrid.fnal.gov:8080/
    • Morag Burgon-Lyon’s Talk on SAM-Grid for CDF!