Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How eBay does Automatic Outage Planning

1,068 views

Published on

How eBay does Automatic Outage Planning

  • Be the first to comment

  • Be the first to like this

How eBay does Automatic Outage Planning

  1. 1. Automatic Outage Planning at eBay November 4, 20145 Kevin Isaacson eBay Batch Tools and Infrastructure
  2. 2. 3 Property of Automic Software. All rights reserved Environment and background • eBay Marketplaces Site environment – Jobs with direct impact on the site, billing, customer emails – ~1000 agents, 120,000 executions per day – 1090 site databases • Reasons for Outage Tool – Weekly maintenance window Thursday evenings, multiple databases taken down – Job dependencies not provided by PD – Job volume makes it impossible to manually plan around maintenance – Large maintenance can cause hundreds of job aborts
  3. 3. 4 Property of Automic Software. All rights reserved • Solve two root problems – How do we determine which jobs connect to which databases – How do we have those jobs avoid outages without manual intervention • Design system using Automic internal features wherever possible Goals
  4. 4. 5 Property of Automic Software. All rights reserved Outage Tool Architecture • Two main components – Dependency scanner – Finds external dependencies (sqlnet, ftp, http connections) – Outage prefix script – Uses dependency data to prevent jobs aborts • All data stored in Automic VARA objects and archive keys – Can be viewed within GUI without additional tools – No additional tables required
  5. 5. 6 Property of Automic Software. All rights reserved Data Population VARA. NEXT_OUTAGE AGENT JOBS. DEP_SCAN AGENT JOBS. DEP_SCAN AGENT JOBS. DEP_SCAN VARA.DEP. WORKFLOW.JOB JOB Archive Key OB=DELAY Post Process tab :INC POST_OUTAGE_RESTART
  6. 6. 7 Property of Automic Software. All rights reserved Dependency Scanner: Overview • Perl script • Scheduled every 303 minutes on agent group with all agents • Uses OS commands to gather open connections and tie them to job objects – ps command to gather list of running Automic jobs • Convert name of temp file to base 10 job runid – pstree command gets child processes – pfiles (Solaris) or lsof (Linux) to get list of external connections – Dependency data written to stdout. Example: DEP=187270226:caty2phx8.vip.ebay.com:1521 DEP=187332422:phxuc4app02.phx.ebay.com:2221 DEP=187332422:cal.vip.phx.ebay.com:1118
  7. 7. 8 Property of Automic Software. All rights reserved • Harvester job parses output from RT table, looks up job and workflow names, creates Automation Engine script files • Executes CallAPI on the script files to update VARA objects :PUT_VAR VARA.DEP.CIMS_RADAR_ASSERTION_SUB.V3_CIMS_RADAR_ASSERTION_BATCH_3, "tns.vip.ebay.com", "1521", "Mon Jan 23 16:52:19 2012” :PUT_VAR VARA.DEP.RADAR_ARCHIVE_EVENTS.V3_RADAR_ARCHIVE_EVENTS_30, "storageservice.vip.slc.ebay.com", "80", "Mon Jan 23 16:52:20 2012”
  8. 8. 9 Property of Automic Software. All rights reserved Dependency Variable Example VARA.DEP.WORKFLOW.JOB
  9. 9. 10 Property of Automic Software. All rights reserved Outage Prefix Script • Called by HEADERHEADER.UNIX.USER.PRE in client 0 • Runs before every job Exit Exit Exit Perform defined behavior Is a current or future outage defined? Per my ERT, will I be affected? Are any resources in the outage also in my VARA? N N N Y Y Y
  10. 10. 11 Property of Automic Software. All rights reserved VARA.NEXT_OUTAGE
  11. 11. 12 Property of Automic Software. All rights reserved Supported Outage Behaviors • Put the following in either of the Archive Key fields (on General tab) for the JOB. • OB=[behavior value]; – DELAY - Job will delay until the end of the outage window (it is not necessary to "delete waiting jobs" because the queued jobs will all evaluate individually whether they should start or skip) – SKIP - We decided that DELAY is almost always better. The job will skip any runs that would extend into the outage window – RESTART (not implemented yet) - if it aborts, change exit code to 0 and message to "MAINT_RESTARTED" and restart it after the window. • Must include the command ":INCLUDE INC.OUTAGE.POSTPROCESS" on the post-processing tab of the job – IGNORE - if it aborts, change exit code to 0 and message to "MAINT_IGNORED” • Must include the command ":INCLUDE INC.OUTAGE.POSTPROCESS" on the post-processing tab of the job
  12. 12. 13 Property of Automic Software. All rights reserved Weaknesses & Improvements • Tough to catch connections for short-running jobs • Takes days or weeks to build full dependency data for new job • Overhead for prefix script on every job • No input validation for manually edited items (OB in archive key, VARA.NEXT.OUTAGE settings) • Dependencies apply only to jobs, would be nice to apply outage behavior to whole workflow

×