First Steps With Grid Computing
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,439
On Slideshare
1,437
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
73
Comments
0
Likes
0

Embeds 2

http://www.slideshare.net 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1.  
  • 2. First Steps with Grid Computing & Oracle Application Server 10 g Venkata Ravipati Product Manager Oracle Corporation Session id: 40187 Sastry malladi CMTS Oracle Corporation Jamie Shiers IT Division, CERN [email_address]
  • 3. Agenda
    • Introduction Grid Computing
    • OracleAS 10 g Features
    • CERN Case Study
    • OracleAS 10 g Roadmap
    • Q&A
    Introduction Grid Computing
  • 4. IT Challenges
    • Enterprise I/T is highly fragmented, leading to
      • poor utilization, excess capacity, and systems inflexibility.
    • Adding capacity is complex and labor-intensive
    • Systems are fragmented into inflexible “islands”
    • Expensive server capacity sits underutilized
    • Installing, configuring, and managing application infrastructure is slow and expensive
    • Poorly integrated applications with redundant functionality increase costs and limit business responsiveness
  • 5. Grid Computing Solves IT Problems
    • High cost of adding capacity
    • Islands of inflexible systems
    • Underutilized server capacity
    • Hard to configure and manage
    • Poorly integrated applications with redundant functions
    • Pool modular, low-cost hardware components
    • Virtualize system resources
    • Dynamically allocate workloads and information
    • Unify management and automate provisioning
    • Compose applications from reusable services
    IT Problem Grid Solution
  • 6. What is Grid computing
    • Grid computing is a hardware and software infrastructure that enable
      • Transparent Resource Sharing across an enterprise:Divisions,Data Centers, Resources Categories
        • Computers
        • Storage,
        • Databases
        • Application Servers
        • Applications
      • Coordination resources that are not subject to centralized control
      • Using standard, open, general-purpose protocols and interfaces
      • To deliver nontrivial qualities of service
  • 7. Enterprise Grid Infrastructure Must Be Comprehensive Management Middleware Database Storage
  • 8. Agenda
    • Introduction Grid Computing
    • OracleAS 10 g Features
    • CERN Case Study
    • OracleAS 10 g Roadmap
    • Q&A
    OracleAS 10 g Features
  • 9. Introducing Oracle 10 g
    • Complete, integrated grid infrastructure
  • 10. Oracle Application Server 10 g 10 g Workload Management Workload Management Software Provisioning User Provisioning Application Availability Application Development Application Monitoring
  • 11. Workload Management
    • Adding and allocating computing capacity is expensive and too slow to adapt to changing business requirements
    • Virtualize servers as modular HW resources
    • Virtualize software as reusable run-time services
    • Manage workloads automatically based on pre-defined policies
    IT Problem Oracle 10 g Solution
  • 12. Virtualized Hardware Resources Add Capacity Quickly and Economically
  • 13. Virtualized Middleware Services Accounting Application Group Collections of Resources and Runtime Services into Logical Applications HTTP Server Web Cache J2EE Server
  • 14. Policy-based Workload Management Policy Manager Stores application-specific policies Resource Manager Manages resource availability/status Dispatcher & Scheduler Distribute workloads based on application-specific policies Workload Manager
  • 15. Middleware Services
    • HTTP servers
    • Web caches
    • J2EE servers
    • EJB processes
    • Portal services
    • Wireless services
    • Web services
    • Integration services
    • Directory services
    • Authentication services
    • Authorization services
    • Enterprise Reporting services
    • Query Analysis services
  • 16. Metrics-based Workload Reallocation Unexpected demand!  shift more capacity to Web Store
    • Employee Portal
    • Portal
    • Accounting
    • Discoverer, reports
    • Web Store
    • HTTP, J2EE Server
  • 17. Scheduled Workload Reallocation General Ledger Order Entry General Ledger Order Entry Start of Quarter: End of Quarter:
  • 18. Policy-based Edge Caching
    • Virtualized pools of storage enable sharing and transfer of data between nodes
    • Adaptive caching policies flexibly accommodate changing demand
    Virtual HTTP Server Grid Caches Client
  • 19. Oracle Application Server 10 g 10 g Software Provisioning User Provisioning Application Availability Application Development Application Monitoring Workload Management
  • 20. Software Provisioning
    • Installing, configuring, upgrading and patching systems is labor-intensive and too slow to adapt to changing business requirements
    • Manage virtualized HW and SW resources as one system
    • Automate installation, configuration, upgrading, and patching processes
    IT Problem Oracle 10 g Solution
  • 21. Software Provisioning
    • Grid Control Repository (GCR) with centralized inventories for installation and configuration
      • Provision servers
      • Provision software
      • Provision users
    Grid Control Repository
  • 22. Automated Deployment
    • Install and configure a single server node
    • Register configuration to the Repository
    • Automatically deploy to nodes as they are added to the grid
    Grid Control Repository
  • 23. Software Cloning Select Software and Instances to Clone 1
    • Automated provisioning based on master node
    • Archive & replicate specific configurations
      • e.g. : Payroll config. optimized for Fridays at 4:00pm
    • Context-specific adjustments
      • e.g. : IP address, host name, web listener
    Update Configuration Inventory in GCR 3 Clone to Selected Targets 2
  • 24. Patch and Update Management
    • Real-time discovery of new patches
    • Automated staging and application of patches
    • Rolling application upgrades
    • Patch history tracking
    Determine Applicability 2 Apply Patch/ Upgrade 3 Patch Published 1 Update Patch Inventory in GCR 4
  • 25. Oracle Application Server 10 g 10 g Software Provisioning User Provisioning Application Availability Application Development Application Monitoring Workload Management
  • 26. User Provisioning
    • It takes too long to register new users
    • Users have too many accounts, passwords, and privileges to manage
    • Developers re-implement authentication for each new application
    • Centralized identity management
    • Shared authentication service
    IT Problem Oracle 10g Solution
  • 27. Single Sign-on Across the Grid Accounting Sales Portal Directory Support Portal
    • Consolidate accounts
    • Simplify management
    • Facilitate re-use
    Client
  • 28.
    • Create users once
      • Centrally manage roles, privileges, preferences
    • Support single password for all applications
    • Delegate administration
      • Locally administered departments, LOBs, etc.
      • User self-service
    • Interoperate with existing security infrastructure
    User Provisioning
  • 29. Oracle Application Server 10 g 10 g Software Provisioning User Provisioning Application Availability Application Development Application Monitoring Workload Management
  • 30. Application Availability
    • Ensuring required levels of availability is too expensive
    • Modular components provide inexpensive redundancy
    • Coordinated response to system failures ensures application availability
    IT Problem Oracle 10g Solution
  • 31. Application Availability
    • Transparent Application Failover (TAF)
      • Automatic session migration
    • Fast-Start Fault Recovery™
      • Automatic failure detection and recovery
    • Multi-tier Failover Notification (FaN)
      • Speeds end-to-end application failover time
      • From 15 minutes to <15 seconds
  • 32. Transparent Application Failover
    • Employee Portal
    • Portal
    • Accounting
    • Discoverer, reports
    • Web Store
    • HTTP, J2EE Server
    Resource failure!  fail-over the service to additional nodes
  • 33. Fast-Start Fault Recovery™
    • Employee Portal
    • Portal
    • Accounting
    • Discoverer, reports
    • Web Store
    • HTTP, J2EE Server
    Nodes recovered  re-instate automatically
  • 34.
    • Overcomes TCP/IP timeout delays associated with cross-tier application failovers:
    Multi-tier Failover Notification (FaN) > 15 mins < 12 secs 15 mins < 4 secs Without FaN With FaN RAC Failover AS Detection Total Downtime < 8 secs* < 8 secs*
  • 35. Oracle Application Server 10 g 10 g Software Provisioning User Provisioning Application Availability Application Development Application Monitoring Workload Management
  • 36. Application Monitoring
    • Insufficient performance data to plan, tune, and manage systems effectively
    • Software pre-instrumented to provide status and fine-grained performance data
    • Centralized console analyzes and summarizes Grid performance
    IT Problem Oracle 10g Solution
  • 37. Application Monitoring
    • Monitor virtual application resources
      • e.g. : J2EE containers, HTTP servers, Web caches, firewalls, routers, software components, etc.
    • Root cause diagnostics
    • Track real-time and historic performance metrics
      • App. availability, business transactions, end user perf.
    • Notifications and alerts
    • Administer service level agreements (SLAs)
  • 38. Repository-based Management
    • Centralized repository-based management provides a unified view of entire infrastructure
    • Manage all your end-to-end application infrastructure from any device
    Grid Control Repository Computer Host Database Storage System App Server Client Router/Switch Firewall Portals Clusters Integration Web Sites Custom Apps
  • 39. Performance Monitoring
    • Capture real-time and historical performance data
    • Analyze and tune workload policies
    • Answer questions like:
      • “ How much time is being spent in just the JDBC part of this application?”
      • “ What was the average response time over the past 3, 6, and 9 months?”
  • 40.
    • User specified targets, metrics, and thresholds
      • e.g. : CPU utilization, user response times, etc.
    • Flexible notification methods
      • e.g. : Phone, e-mail, fax, SMS, etc.
    • Self-correction via pre-defined responses
      • e.g. : Execute a script to shut down low priority jobs
    Policy-based Alerts
  • 41. Agenda
    • Introduction Grid Computing
    • OracleAS 10 g Features
    • CERN Case Study
    • OracleAS 10 g Roadmap
    • Q&A
  • 42. LHC Computing Grid Project Oracle-based Production Services for LCG 1
  • 43. Goals
    • To offer production quality services for LCG 1 to meet the requirements of forthcoming (and current!) data challenges
      • e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC’04
    • To provide distribution kits, scripts and documentation to assist other sites in offering production services
    • To leverage the many years’ experience in running such services at CERN and other institutes
      • Monitoring, backup & recovery, tuning, capacity planning, …
    • To understand experiments’ requirements in how these services should be established, extended and clarify current limitations
    • Not targeting small-medium scale DB apps that need to be run and administered locally (to user)
  • 44. What Services?
    • POOL file catalogue using EDG-RLS (also non-POOL!)
      • LRC + RLI services + client APIs
      • For GUID <-> PFN mappings
    • and EDG-RMC
      • For file-level meta-data: POOL currently stores:
        • filetype (e.g. ROOT file), fully registered, job status
      • Expect also ~10 items from CMS DC04: others?
    • plus (service behind) EDG Replica Manager client tools
    • Need to provide robustness, recovery, scalability, performance, …
    • File catalogue is a critical component of the Grid!
      • Job scheduling, data access, …
  • 45. The Supported Configuration
    • All participating sites should run:
    • A “Local Replica Catalogue” (LRC)
      • Contains GUID <-> PFN mapping for all local files
    • A “Replica Location Index” (RLI) <-- independent of EDG deadlines
      • Allows files at other sites to be found
      • All LRCs are configured to publish to all remote RLIs
        • Scalability beyond O(10) sites??
        • Hierarchical and other configurations may come later…
    • A “Replica Metadata Catalogue” (RMC)
      • Not proposing a single, central RMC
      • Jobs should use local RMC
      • Short-term: handle synchronisation across RMCs
        • In principle possible today “on the POOL-side” (to be tested)
      • Long-term: middleware re-engineering?
  • 46. Component Overview Replica Location Index Local Replica Catalog Storage Element
      • CNAF
    Replica Location Index Local Replica Catalog Storage Element RAL Replica Location Index Local Replica Catalog Storage Element CERN Replica Location Index Local Replica Catalog Storage Element IN2P3
  • 47. Where should these services be run?
    • At sites that can provide supported h/w & O/S configurations (next slide)
    • At sites with existing Oracle support team
    • We do not yet know whether we can make Oracle-based services easy enough to setup (surely?) and run (should be for canned apps?) where existing Oracle experience is not available
      • Will learn a lot from current roll-out
      • Pros: can benefit from scripts / doc / tools etc.
      • Other sites: simply re-extract catalog subset from nearest Tier1 in case of problems?
      • Need to understand use-cases and service level
  • 48. Requirements for Deployment
    • A farm node running Red Hat Enterprise Linux and Oracle9iAS
      • Runs Java middleware for LRC, RLI etc.
      • One per VO
    • A disk server running Red Hat Enterprise Linux and Oracle9i
      • Data volume for LCG 1 small (~10 5 – 10 6 entries, each < 1KB)
      • Query / lookup rate low (~1 every 3 seconds)
        • Projection to 2008: 100 – 1000Hz; 10 9 entries
      • Shared between all VOs at a given site
    • Site responsible for acquiring and installing h/w and RHEL
      • $349 for ‘basic edition’ http://www. redhat .com/software/ rhel / es /
  • 49. What if?
    • DB server dies
      • No access to catalog until new server configured & DB restored
      • ‘ Hot standby’ or clustered solution offers protection against most common cases
      • Regular dump of full catalog into alternate format, e.g. POOL XML?
    • Application server dies
      • Stateless, hence relatively simple move to a new host
        • Could share with another VO
      • Handled automatically with application server clusters
    • Data corrupted
      • Restore or switch to alternate catalog
    • Software problems
      • Hardest to predict and protect against
      • Could cause running jobs to fail and drain batch queues!
      • Very careful testing, including by experiments, before move to a new version of the middleware (weeks, including smallish production run?)
    • Need to foresee all possible problems, establish recovery plan and test!
    What happens during period when catalog is unavailable?
  • 50. Backup & Recovery, Monitorin g
    • Backend DB included in standard backup scheme
      • Daily full, hourly incrementals + archive log – allows point in time recovery
      • Need additional logging plus agreement with experiments to understand ‘point in time’ to recover to – and testing!
    • Monitoring: both at box-level (FIO) and DB/AS/middleware
    • Need to ensure problems (inevitable, even if undesirable) are handled gracefully
    • Recovery tested regularly, by several members of the team
    • Need to understand expectations:
      • Catalog entries guaranteed for ever?
      • Granularity of recovery?
  • 51. Recommended Usage - Now
    • POOL jobs: recommend extracting catalog sub-set prior to job and post-cataloging new entries as separate step
    • Non-POOL jobs, e.g. EDG-RM client: minimum, test RC and implement simple retry + provide enough output in job log for manual recovery if necessary
      • Perpetual retry inappropriate if e.g. configuration error
    • In all cases, need to foresee hiccoughs in service e.g. 1 hour, particularly during ramp-up phase
    • Please provide us with examples of your usage so that we can ensure adequate coverage by test suite!
    • Strict naming convention essential for any non-trivial catalogue maintenance
  • 52. Status
    • RLS/RLI/RMC services deployed at CERN for each experiment + DTEAM
      • RLSTEST service also available, but should not be used for production!
    • Distribution mechanism, including kits, scripts and documentation available and ‘well’ debugged
    • Only 1 outside site deployed so far (Taiwan) – others in the pipeline
      • FZK, RAL, FNAL, IN2P3, NIKHEF …
    • We need help to define list and priorities!
    • Actual installation rather fast (max a few hours)
    • Lead time can be long
      • Assign resources etc – a few weeks!
    • Plan is (still) to target first sites with Oracle experience to make scripts & doc as clear and smooth as possible
      • Then see if it makes sense to go further…
  • 53. Registration for Access to Oracle Kits
    • Well known method of account registration in dedicated group (OR)
    • Names will be added to mailing list to announce e.g. new releases of Oracle s/w, patch sets etc.
    • Foreseeing much more gentle roll-out than for previous packages
    • Initially just DBAs supporting canned apps
      • RLS backend, later potential conditions DB if appropriate
    • For simple, moderate-scale DB apps, consider use of central Sun cluster, already used by all LHC experiments
    • Distribution kits, scripts etc in afs
      • /afs/cern.ch/project/oracle/export/
    • Documentation also via Web
      • http://cern.ch/db/RLS/
  • 54. Links
    • http://cern.ch/wwwdb/grid-data-management.html
    • High level overview of the various components; pointers to presentations on use-cases etc
    • http://cern.ch/wwwdb/RLS/
    • Detailed installation & configuration instructions
    • http://pool.cern.ch/talksandpubl.html
    • File catalog use-cases, DB requirements, many other talks…
  • 55. Future Possibilities
    • Investigating resilience against h/w failure using Application Server & Database clusters
    • AS clusters also facilitate move of machines, addition of resources, optimal use of resources etc.
    • DB clusters (RAC) can be combined with stand-by databases and other techniques for even greater robustness
    • (Greatly?) simplified deployment, monitoring and recovery can be expected with Oracle10 g
  • 56. Summary
    • Addressing production-quality DB services for LCG 1
    • Clearly work in progress, but basic elements in place at CERN, deployment just starting outside
    • Based on experience and knowledge of Oracle products, offering distribution kits, documentation and other tools to those sites that are interested
    • Need more input on requirements and priorities of experiments regarding production plans
  • 57. A Q & Q U E S T I O N S A N S W E R S
  • 58.