White Paper: Don't Let a Bad Trigger Ruin Your Checkin

484 views

Published on

White Paper for Pixar's presentation by Mark Harrison

Published in: Technology
  • Be the first to comment

  • Be the first to like this

White Paper: Don't Let a Bad Trigger Ruin Your Checkin

  1. 1. Don’t Let A Bad Trigger Ruin Your Checkin! Mark Harrison Pixar Animation StudiosOur Trigger Goals.Perforce checkin triggers are very useful to us. Users can check in their files in any waythey see fit, and we can provide our services using post-commit triggers. This workedwell, and had several benefits: • We could guarantee that the triggers would run on all file checkins. There was no path by which the code could be avoided. • We were decoupled from the front end application checking in the file. We did not need to be linked in with or share a release schedule with that code base. • Trigger code could be replayed on a checkin in case of error. Lesson learned: triggers are good!First Try: Pure Triggers.But as we went along, we hit a couple of problems. • As the number of repositories grew (much faster than we anticipated!), it became more work to make sure the triggers were in sync. Adding a new trigger likewise became a much larger task, since we needed to do so to several repositories. • Triggers can hang. Sometimes NFS mounts can go bad, or a bad database state (e.g. an open transaction holding a write lock) can block a trigger.The most ironic problem: triggers worked so well for us, everybody wanted one! Therewere numerous projects that could benefit from being informed when movie assets werecreated or modified; many of these would update some database tables or cache some ofthe data in the assets.This amplified the two problems we had with our own triggers. Calling code outside ofour control meant that we couldnt even fix things ourselves when checkin errors wouldhappen, and each our trigger configurations showed signs of fulfilling the old gamepuzzle of "you are in a twisty maze of little passages, all different." • What depots were supposed to have which triggers? • Having to hand-edit numerous trigger specs whenever somebody changed their software. Harrison - Perforce 2011 Page 1
  2. 2. • As more triggers appeared, checkins got slower. Each trigger is run sequentially, so we couldnt even take advantage of multiple boxes or processors to speed things up. Some of the triggers would scrape metadata out of each file checked in (image formatting, color profiles, etc), so we could conceivable end up having to read each file multiple numbers of times before the checkin would return to the user. • Having to "slightly" modify trigger parameters ("oh, for that depot can you set the option --bargle=4, but if its on a box without NFS patches can you instead use -- bargle=4 and --nopts=2?") • As more triggers started appearing, the number of checking problems due to the triggers started to rise. We certainly didnt want that to happen, since one of Perforces selling points is that its really stable. Lesson learned: lots of triggers are bad!Second Try: Using Triggers to Enqueue Work.We looked at the problem again, focusing on these questions: • How can we allow multiple groups to benefit from check-in driven triggers? • How can we avoid the slowness involved with running multiple triggers? • How can we eliminate the administrative overhead of managing triggers? • How can we eliminate the runtime errors and required troubleshooting with triggers?We came up with these two rules: • Every set of post-submit triggers must be the same across all depots. • The post-submit triggers must execute as quickly as possible.Additionally, we wanted to ensure: • We would be able to accommodate any groups that needed special backend execution. • We would have some means of telling front-end systems that their trigger was finished or that it failed. Preferably this would be a non-blocking mechanism, so that the applications could for example keep their GUIs alive. For non-interactive applications (e.g. thumbnail generation) we would log the errors and provide an error notification. • We could execute these tasks in parallel on different boxes for speed.Our solution was to execute exactly two post-submit triggers: • The LINKATRON (presented at the 2009 conference), which would ensure that the trigger-like programs would have access to the files checked in via NFS, and Harrison - Perforce 2011 Page 2
  3. 3. they wouldnt have to check out the file to process it. This was especially important for media files... think of a several-gig video clip where where some information needed to be extracted from a header record in the file. • Our database backend, which would handle the enqueuing of the files and changelists to other backend applications.We would ensure the backend processors would be first-class members of our perforceinfrastructure by writing all of our own processors as plugins. This also gave us theadvantage of being able to process certain items (e.g. thumbnail generation) in parallel.Our Implementation and Usage.We implemented this system as a workflow queue manager. There are several off-the-shelf queueing systems that could be used, but due to our particular requirements anddevelopment environment we ended up implementing our own.Each application has its own queue, and can register to receive notifications at either: • The file level. This allowed an application such as our thumbnail generator to start processing files quickly, without having to perform the extra processing necessary to read a changelist, break it apart, and start processing each item. It also has the advantage that each of the files can be treated as atomic work units -- if a thumbnail fails for one file, theres no reason all the other thumbnails shouldnt be generated. • The changelist level. For some other applications, it was better to receive exactly one notification per checkin. For these notifications, we included the depot name and the changelist number; if the application wanted to see the contents of the changelist, it could examine that on its own.This has several advantages, both for the end user and for the groups providing thetriggers: • A single broken queue processor does not break a checkin. Of course, if your workflow depends on work being done by that processor you will be blocked, but many tasks (e.g. thumbnail generation or keyword mining) can be done after the fact. • It is easy to identify a queue processor that is broken, and notify the responsible party. If a queue is filling up and nothing is being processed, we issue a warning to the queue owner. • It is easy to see what work needs to be caught up when breakage is repaired. By the nature of the queue system, all uncompleted work is still in the queue, ready to be processed when the processor is restarted.Synchronous Operation Harrison - Perforce 2011 Page 3
  4. 4. In order to handle the requirement that the queue processors operate in a synchronousmanner, we use our internally developed Templar Broadcasting System. This messagingsystem uses multicast UDP. Measurements on our network showed that the there wasminimal (microsecond) latency, and we could handle a sustained rate of 30,000 or moremessages/second reliably. Of course, delivery is not guaranteed, so applications need toprovide an alternate method for verifying that their work has been completed. A typicalapplication might query the database for a particular file or changelist.However, since in our environment multicast is "mostly reliable", we can set a relativelylong timeout period before having to fall back to the polling mechanism. Mostapplications are therefore able to continue almost immediately when the notification issent.SummaryWe followed these steps in our implementation process and are happy with the results.They allows several groups to write checkin-time code, and give protection to anybreakage of these bits of code. • Triggers • Lots of triggers • Small number of triggers, feeding work queues Lesson Learned: Trigger + Work Queues are Great! Harrison - Perforce 2011 Page 4

×