Scalable Hierarchical Coarse Grained Timers


Published on

Scalable Hierarchical Coarse Grained Timers
ACM Operating Systems Review, January 2000, Volume 34, Number 1: Pages 11-20

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scalable Hierarchical Coarse Grained Timers

  1. 1. Scalable Hierarchical Coarse-grained Timers Rohit Dube High Speed Networks Research Room 4C-508, Bell Labs, Lucent Technologies 101 Crawfords Corner Road, Holmdel, NJ 07733 Email: rohit Abstract— Several network servers and routing and sig- nalling protocols need a large number of events to be sched- uled off timers. Some of these applications can withstand a bounded level of inaccuracy in when the timer is sched- uled. In this paper we describe a novel mechanism called “scalable hierarchical coarse grained timers” which can han- dle the scheduling of a large number of events while incur- ring a minimum of cpu and memory overhead. The tech- niques presented here were implemented on a commercial IP routing system and are used by the routing stack to damp flapping BGP routes. The paper reflects our experiences in carrying out this implementation and the subsequent perfor- mance analysis. I. INTRODUCTION With the explosive growth of network use, a plethora of new services and protocols have been developed. Virtually all of these applications assume a timer facility, both in the definition of the service or protocol and its implementa- tion. Routing and signalling protocols, especially those that deal with a large number of routes and or states, make intense of the the timer abstraction. For example, a router on the border of a routing domain peering with routers in other routing domains using the Border Gateway Proto- col (BGP [15]) may experience a large number of route fluctuations due to network instability and may need to hold-down these routes till they stabilize [19]. In most router systems, this requires extensive use of the timer sub- system. Timer implementations typically have a practical limit on how far they scale in terms of sheer numbers. With any reasonable timer implementation, this limit is dictated by either the cpu or memory usage of the timer module. Given the number of events that can be handled by such a mod- ule, there exist many network applications which require an order of magnitude or more events to be scheduled and descheduled off timers than what the module can handle. If these events use a timer each, it follows that the system resources can not meet the processing requirements of the application. Often times these very applications have a high toler- ance to inaccuracies in scheduling of the events and spend a comparatively small amount of time processing the event as compared to other applications. This allows for aggre- gating events which are to be scheduled within a short du- ration of each other. This set of events can be scheduled using a single timer, reducing the total number of timers in use as well as the overhead of the timer sub-system by processing multiple events per timer. In the following sec- tions we describe a technique which combines this allowed coarseness with hierarchical stacking of timer modules to produce a highly scalable timer sub-system. A timer module using the concepts described in this pa- per was implemented on a commercial IP router. The BGP route-flap damping code was then modified to use this new timer module. As we demonstrate in the subsequent sec- tions, the implementation is modular, highly scalable and lends itself to tuning as per an applications requirements. We start with a description of the main concepts of this technique in Section II. This is followed by implemen- tation details in Section III, a real-life application to BGP route-flap damping in Section IV and an analysis of the de- sign in Section V. Related work is discussed in Section VI just before we conclude in Section VII. II. HIERARCHICAL TIMERS AND COARSENESS Most complete systems like operating-systems or routers have a very high resolution clock (potentially sup- plied by the hardware) around which a timer module is built. Such a timer module, which is typically imple- mented in software has a resolution in the microsecond– millisecond range. The clock and the timer module rep- resent the bottom two layers of the hierarchy depicted in figure 1. Well implemented timer modules scale into the range of thousands in terms of outstanding timer events and meet the requirements of most applications which use them directly. However, the Level 0 timer module is not adequate for certain applications which schedule and deschedule up- wards of a hundred thousand events off timers. This lim- itation is because of the total amount of memory required by the Level 0 timer, one of which is needed for each event scheduled by the application. More seriously, the
  2. 2. 1 2 Level 1 Timers N Level 0 Timers Low Level Clock Fig. 1. Hierarchical Timers cpu overhead of firing a Level 0 timer (which may be a system call) per event is prohibitive and can severely de- grade the performance of other sub-systems supported by the Level 0 timers. Such applications, one of which is dis- cussed later in Section IV, can be supported with a Level 1 timer module which makes use of the primitives provided by the timer module at Level 0. In theory, one can stack a third layer of timers on top of an instance of Level 1 timers and so on. Two properties need to be satisfied by a timer module layered on top of another. First, it must have a coarser resolution compared to the resolution of the lower level timer module it sits on. Second, it must have the ability to bundle events which are scheduled at the same instant, using a single lower level timer. 21 3 4 Level 0 Resolution Level 1 Resolution Fig. 2. Scheduling with coarse resolution Consider the time line in figure 2. The shorter notches in the diagram stands for the resolution of Level 0 timers whereas the longer notches depict the resolution for Level 1. The small buckets at the bottom represent events to be scheduled. If these events are scheduled using Level 1 timers, they are forced to align along the resolution bound- ary of the Level 1 timers. In this process, some of them like events 3 and 4 are scheduled at the same time and can be managed by the same Level 1 timer. This bundling due to the coarse resolution of the Level 1 timers allows the Level 1 module to support a large number of events using a rela- tively small number of Level 0 timers. Note that a Level 1 timer is a Level 0 timer with one or more events queued off it. For clarity, we will refrain from using the term ‘Level 1 timer’. Instead, we will refer to these timers as ‘Level 0 timers with multiple queued events’. Since the Level 1 module makes use of Level 0 timers, the features provided by and the performance of the Level 0 module directly effect the Level 1 timers. For example, if the Level 0 timers provide a low-priority or bulk facility, the Level 1 module can be made to exclusively use this facility. This helps the overall system stability as higher priority events can be scheduled by the Level 0 module over the Level 1 events which are resistant to inaccuracies and slip. III. IMPLEMENTATION DETAILS An implementation of a Level 1 module was carried out on a commercial IP routing system [12]. The implemen- tation was done using C [9] and is platform independent having been tested on Linux [14] and Solaris [17] besides the IP router. In this section we describe the salient points of this implementation. A. Level 1 Timer Interface Like any other timer implementation, the Level 1 timer module has the standard Application Programming Inter- face (API) - timer_init (resolution, maxevents, handlers, callback); timer_schedule (instance, event, poptime); timer_deschedule (instance, event); The difference over typical timer APIs come from sup- porting multiple instances of this module and from the use of handlers which are needed because the module does not understand the internal structure of the events to be sched- uled and descheduled but needs access to some of the fields of the event’s data structures for efficient queuing and de- queuing. The ‘resolution’ parameter in the timer init() call is a measure of coarseness in seconds. The module ensures that any Level 0 timers used will be spaced at ‘resolution’ seconds. The lowest resolution supported by our imple- mentation is 1 second. The second parameter ‘maxevents’ is the maximum number of events that can be queued off a single Level 0 timer. This is important so that the user has a way of preventing the system from going into an extended loop in the case where a large number of events are queued to be fired at the same time. The ‘callback’ is a function or a method provided by the user application to process the event when the timer fires. The ‘callback’ is unusual in the sense that it can process multiple events with the aid of the ‘handlers’. The timer init() call return an ‘instance’ which is a ref- erence to the instantiation of the Level 1 module being re- ferred to. The ‘instance’ is an input to the schedule and deschedule calls. The ‘event’ parameter is a pointer or ref- erence to the event being scheduled or descheduled. The
  3. 3. ‘poptime’ parameter is the time increment from when the call is made to when the timer is expected to be fired (i.e. the scheduled time of the event). B. Scheduling and Descheduling As the astute reader may have realized, queuing multi- ple events of the same Level 0 timer requires a scheme to quickly seek to an existing timer. When a timer schedule() call is made to schedule an event, the module needs to de- termine if there is an existing timer with space to accom- modate a new event. Similarly, when an application de- cides to deschedule an event, the module needs to seek to the timer queue holding the event and dequeue it. This seek to an existing timer is a search problem which can be solved efficiently with a balanced-tree or a hash- table lookup. But for either of these search schemes to be used, the module needs a key to uniquely identify a timer. We use the absolute time since the module is initialized to generate this key. A timer is simply identified by the time difference between when the timer is to be fired and the initialization time. We use a red-black tree (rb-tree) [2] as the search mechanism. Figure 3 shows a block level view of the software architecture - the Level 1 module sits on top of and is implemented using the primitives from an rb-tree module and the Level 0 timer module. RB-tree moduleLevel 0 module Level 1 module Fig. 3. API Layering A search mechanism such-as an rb-tree makes the im- plementation of timer schedule() call efficient, but not nec- essarily that of the timer deschedule() call. This is be- cause multiple events may be scheduled off the same timer and searching for the event to be descheduled involves yet another search. To solve this problem, the module requires the user application to provide ‘handlers’ in the timer init() call using which the module maintains the events queued off a timer into a doubly linked list. This allows for efficient queueing onto and dequeuing from the timer queue. Finally, since the module allows the user to place a bound on the maximum number of events to be queued off a single timer, multiple timers are needed to accommodate events in excess of ‘maxevents’. The module maintains multiple Level 0 timers in a doubly linked-list attached to the rb-tree node. As far as the Level 1 module is con- cerned, these timers fire at the same time. Of-course the Level 0 module which actually fires the timers will space them depending on the load that it sees. Since the Level 1 module is coarse-grained to begin with, the loss of accu- racy is not a cause for concern. Level 1 Timer Nodes Scheduled Events Level 0 Timers in Timer Queues Fig. 4. Run-time View Figure 4 shows a run-time view of a module instance. The rightmost Level 1 timer node is expanded out. This node contains two Level 0 timers, the first one of which is filled to capacity and the second one holds the overflow. A few additional notes need to be made at this point. First, the Level 0 timer nodes belonging to the same Level 1 node are arranged in a double linked list. This is to fa- cilitate easy cancellation and removal of the Level 0 timer, in case all the events from one of the Level 0 timers are descheduled. Second, when a new event is scheduled, the linked list of timers is traversed as the module looks for space in the first available timer. Thus a counter or a flag needs to be maintained in the Level 0 node, indicating the number of events queued off the node. If no timer has less than ‘maxevents’ events in its queue, a new Level 0 timer is obtained and a new Level 0 node added to the linked list. Third, it is possible at run-time that more than one of the queues hanging off Level 0 nodes which are in turn attached to the same Level 1 node are less than full. These queues can be compacted if desired. We left out this optimization from our implementation as the overhead of compacting and the subsequent cost of maintaining the optimization in large systems was deemed more than the potential savings of Level 0 timers. IV. APPLICATION TO BGP ROUTE-FLAP DAMPING The Border Gateway Protocol (BGP [7], [8], [15]) is the inter-domain routing protocol of choice in the Internet to- day. Most Internet service providers make a distinction be- tween their border routers (which import routes from their customers and other service providers) and core routers (which are higher capacity and see half a million to a mil- lion prefixes). The Internet sees a great amount of instabil- ity which leads to constant flapping of routes [10], [11]. If
  4. 4. these flaps are allowed to propagate all the way to the core routers, their performance is seriously impacted. Service providers therefore turn on flap-damping [19] on their bor- der routers (see figure 5) for routes learnt from routers ex- ternal to their network. Flapping routes are held down till they stabilize before they are passed on to internal routers which are part of the core network. Internal BGP Sessions External BGP Sessions External Routes Damp Internal Routes Fig. 5. BGP border router damps route flaps During periods of serious disruption in the networks of other providers, several thousand routes can flap up and down. The border router needs to be able to damp all of these. Naively implemented damping uses a Level 0 timer for each flapping route. The timer is set to fire after the period of the hold-down. In case the route flaps again, the timer needs to be canceled and scheduled again for a later time (or rescheduled depending on the Level 0 primitive supported). During large-scale network outages, the usage of Level 0 timers goes up as the number of flapping routes increase in volume, severely slowing down the timer sub- system and effecting other protocols which may drop into failure modes. Route-flap damping does not require a great deal of ac- curacy and routes held down for a few additional sec- onds do not impact the network adversely by much. This presents and ideal opportunity for using a Level 1 timer module. Indeed, the implementation described in the preceding sections was grafted onto a commercial IP router [12]. The direct use of Level 0 timers was taken out and replaced by a Level 1 module within a matter of days. Most of the work involved came from coding the ‘handlers’, modifying the ‘callback’ function and testing the system for regression. Even so the overall task of migrating to the Level 1 module was simple. V. ANALYSIS An application using Level 1 timers in place of Level 0 timers decreases both the memory and cpu usage of the system. The improvement in memory usage comes sim- ply because fewer Level 0 timers are needed by the sys- tem. Similarly, the cpu usage is improved because the Maxevents Resolution 10 50 250 (in seconds) 2 98 98 98 4 92 92 92 8 85 85 85 16 77 77 77 32 64 64 64 TABLE I 100 EVENTS cost of scheduling a timer and subsequently firing or de- scheduling it is amortized over multiple events which are processed from the same callback. This is especially true for systems which implement the Level 0 module in kernel space, typically with pre-allocated memory. If the Level 1 module is implemented in user space on these systems, the failure characteristics are of the system are improved in addition to memory scalability - the kernel memory us- age is decreased as fewer Level 0 timers are used as is the probability of an event being denied a timer. Further, the cpu utilization is minimized as the expensive kernel-user boundary is crossed once per Level 0 timer scheduled. The total number of Level 0 timers used are a small fraction of the total number of events scheduled implying that there are fewer kernel-user crossings in all. The following ex- periments corroborate this claim. A. Experimental Results Tables I to V show data obtained from sample runs us- ing the previously described implementation of the Level 1 module. The metric that we record is the number of Level 0 timers actually used by the module for varying ‘resolution’, ‘maxevents’ and the total number of events scheduled (see Section III for definitions). We recorded runs with resolutions of 2, 4, 8, 16 and 32 seconds and maxevents of 10, 50 and 250 under loads of 100, 1000, 10000, 100000 and 1000000 events. In all cases, the events were uniformly distributed to be scheduled over a period of 3600 seconds (one hour). During the experimental runs, special care was taken to ensure that all the modules had adequate memory. No failures were observed on the sys- tem. In the experiment with 100 events (table I), the num- ber of Level 0 timers used is comparable to the number of events scheduled. This is to be expected as the the time scale over which the events are scheduled is quite large compared to the total number of events. The experiment
  5. 5. Maxevents Resolution 10 50 250 (in seconds) 2 780 780 780 4 617 617 617 8 410 410 410 16 226 224 224 32 143 113 113 TABLE II 1,000 EVENTS Maxevents Resolution 10 50 250 (in seconds) 2 1849 1797 1797 4 1391 902 902 8 1196 452 452 16 1107 269 226 32 1047 235 113 TABLE III 10,000 EVENTS Maxevents Resolution 10 50 250 (in seconds) 2 10814 3154 1849 4 10417 2586 924 8 10207 2238 475 16 10106 2114 457 32 10050 2059 454 TABLE IV 100,000 EVENTS Maxevents Resolution 10 50 250 (in seconds) 2 101029 21131 5543 4 100513 20553 4619 8 100259 20283 4244 16 100131 20141 4141 32 100058 20072 4074 TABLE V 1,000,000 EVENTS with 1,000,000 events (table V) shows impressive savings with a best case of 4074 Level 0 timers proving adequate for scheduling all the events. Looking through a row of any of tables II, III, IV and V, it is clear that the number of Level 0 timers used are reduced as the maximum number of events queued off a single timer increase. Similarly, as the coarseness in- creases (or resolution drops), the number of timers needed decrease. The explanation for both observations is the higher occupancy of the timer queues. (As of this writ- ing we are unable to make available results comparing the cpu and memory usage with and without the Level 1 mod- ule. These will be included in the final version of the paper if required). 10 100 1000 10000 100000 1e+06 100 1000 10000 100000 1e+06 Log(timersused) Log(events scheduled) Level 1, resolution 8 seconds, maxevents 250 Level 0 only Fig. 6. Comparison Plot In the application discussed in section IV, a setting of 8 seconds for the ‘resolution’ and 250 for the ‘maxevents’ would be considered appropriate. Figure 6 plots the num- ber of Level 0 timers used for this setting against the ex- pected number of Level 0 timers if used directly. Note that the axes are log10() in order to meaningfully accommodate the large numbers from the experiments. B. Algorithmic Analysis Having discussed the empirical results, an algorithmic analysis of this scheme is in order. The point of inter- est here is the performance of the timer schedule() and timer deschedule() calls. As we show below, the running time of these calls largely depends on the whether a Level 0 timer call is required for the operation. We start with some terminology - assume that the num- ber of Level 1 nodes in the steady state is n and that the av- erage number of Level 0 timers per Level 1 node is b. Fur- ther assume that the time taken to insert (delete) an event in the queue for a Level 0 timer is the constant Qa (Qd). The time taken to create (cancel) a Level 0 timer is depen- dent on the implementation of that module - we assume it is given by the function La() (Ld()).
  6. 6. A timer schedule() call takes O(log(n)) (for the rb-tree search) + O(b) (to walk through the Level 0 nodes) + Qa (to insert the event in the queue) time, when an existing timer can accommodate the item. If there is an existing Level 1 timer but no space in any of the existing queues, the call takes O(log(n)) + O(b) (to search the tree) + La() (to create a new Level 0 timer) + Qa time. If there is no matching Level 1 timer in the tree, the call takes O(log(n)) + La() + Qa time. timer deschedule() calls take Qa time when the de- scheduled event doesn’t leave behind an empty timer queue and O(b) + Qd + Ld() when the Level 1 node stays intact but a Level 0 node is to be deleted because of its queue becoming empty. Finally, if the Level 1 node itself is deleted in addition to the Level 0 node be- cause there are no events queued at all, the running time is O(log(n)) + Qd + Ld(), the O(log(n)) coming from the delete operation on the rb-tree. The case where a new Level 1 node is created or deleted is not treated further as the properties directly reflect those of rb-tree where inserts, deletes and searches all take O(log(n)) time. Also, note that ‘b’ is expected to be small in practice and the doubly linked list suffices for small ‘b’. If this is not the case for a set of applications, the list can be replaced with a red-black tree to yield O(log(b)) time instead of O(b). La() and Ld() overshadow the other costs as acquiring or canceling a Level 0 timer minimally requires an API boundary crossing. This API boundary crossing is often an expensive system call. Note that without the Level 1 mod- ule, scheduling an event would always cost La() whereas descheduling would always cost Ld(). Hence, as the occu- pancy of the timer queues (figure 4) increases, the running time of the Level 1 module improves as most schedule and deschedule calls complete without requiring a Level 0 op- eration. VI. RELATED WORK Brown [1] and Davison [4] independently discovered calendar queues which are modeled after desk calendars and can be used to implement a timer facility. Varghese et. al. [18] describe a way of building scalable timer imple- mentations using cascaded timing wheels each of which is similar to a calendar queue. These techniques qualify as Level 0 in the hierarchy of figure 1 and have been used by multiple Unix-like operating systems to implement the timeout facility [3]. The concept of using hierarchical timers has been im- plicitly used at various times in the operating systems and data networks community. Most significantly, various fla- vors of BSD [13] implement the TCP/IP stack by layering protocol timers on top of a few kernel timers obtained from the timeout facility. The kernel timers in use by the stack fire at every tick and the callback routine that is called walks thorough the active protocols calling a function each per protocol which in turn processes the protocol timers which need to be fired at that instant. Further implementa- tion details on this can be found in [20]. Sharma et. al. [16] describe a way of dynamically ad- justing timers in soft state protocols in order to keep the protocol control traffic bounded. The idea of dynamic ad- justments can be applied to the Level 1 timer implementa- tion by allowing on the fly changes to the ‘resolution’ and ‘maxevents’ parameters which control the occupancy of the timer queues and hence the efficiency of the module. VII. SUMMARY In this paper we have described in detail a novel mecha- nism which trades off accuracy in favor of scalability. The result is a highly scalable timer module built on top of existing finer granularity timer implementations with the help of a fast access algorithm (in our implementation, a red-black tree). This mechanism has been implemented in a commercial system where one of its applications is to damp BGP route-flaps which in the worst case generate a load of a half to one million prefixes i.e. events which need to be scheduled off timers. Note that the design discussed here may not be suit- able (without modification) for certain applications as it does not provide a way to directly control the jitter of the timers. For example routing networks which can synchro- nize without deliberate jitter in the control messages [6] may not be built on top of the module as described here. On the other hand, jitter is a functionality typically pro- vided by Level 0 modules and the the Level 1 API can be extended to control the jitter of the lower level timers. Scale permitting, applications are of-course free to use the Level 0 timers directly. Acknowledgements We would like to thank David Ward (IENG) and Ping Pan (Bell Labs) for discussions which started the train of thought that led to the development of this idea. We would also like to thank Bernhard Suter and Lampros Kalam- poukas (Bell Labs) and Sambit Sahu (University of Mas- sachusetts) for helpful comments on preliminary versions of this paper and Shivkumar Haran (Lucent Technologies) for helping debug the implementation. Note: The authors employers may patent the ideas pre- sented in this paper [5].
  7. 7. REFERENCES [1] R. Brown. Calendar Queues: A Fast O(1) Priority Queue Imple- mentation for the Simulation Event Set Problem. Communica- tions of the ACM, 31(10), 1988. [2] T.H. Corman, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. McGraw-Hill, 1991. [3] A. Costello and G. Varghese. Redesigning the BSD Callout and Timer Facilities. Technical Report 95-23, Washington University, St. Louis, MO, 1995. [4] G. Davison. Calendar P’s and Queues. Communications of the ACM, 32(10), 1989. [5] R. Dube. Scalable Hierarchical Coarse-grained Timers. Patent Application. [6] S. Floyd and V. Jacobson. The Synchronization of Periodic Rout- ing Messages. In SIGCOMM Conference. ACM, 1993. [7] B. Halabi. Internet Routing Architectures. Cisco-Press, 1997. [8] J.W. Stewart III. BGP4: Inter-Domain Routing in the Internet. Addison-Wesley, 1998. [9] B.W. Kerninghan and D.M. Ritchie. The C Programming Lan- guage. Prentice Hall, 1988. [10] C. Labovitz, G.R. Malan, and F. Jahanian. Internet Routing Insta- bility. In SIGCOMM Conference. ACM, 1997. [11] C. Labovitz, G.R. Malan, and F. Jahanian. Origins of Internet Routing Instability. In INFOCOM Conference. IEEE, 1999. [12] Lucent Technologies. PacketStar 6400 Series IP Switch On-line User Doucumentation, 1999. Release 1.1. [13] M.K. McKusick, K. Bostic, M.J. Karels, and J.R. Quarterman. The Design and Implementation of the 4.4BSD Operating System. Addison-Wesley, 1996. [14] Red Hat. Linux OS, 6.0 intel edition, 1999. [15] Y. Rekhter and T. Li. A Border Gateway Protocol (BGP-4), March 1995. IETF RFC 1771. [16] P. Sharma, D. Estrin, S. Floyd, and V. Jacobson. Scalable Timers for Soft State Protocols. In INFOCOM Conference. IEEE, 1997. [17] Sun Microsystems. SunOS, 5.5.1 sparc edition, 1997. [18] G. Varghese and A. Lauck. Hashed and Hierarchical Timing Wheels: Efficient Data Structures for Impelementing a Timer Fa- cility. IEEE/ACM Transactions on Networking, 5(6), 1997. [19] C Villamizar, R. Chandra, and R. Govindan. BGP Route Flap Damping, November 1998. IETF RFC 2439. [20] G.R. Wright and W.R. Stevens. TCP/IP Illustrated, Volume 2. Addison-Wesley, 1994.