Threads Implementations




CS 167             VI–1   Copyright © 2006 Thomas W. Doeppner. All rights reserved.




      ...
Outline

         • Threads implementations
           – the one-level model
           – the variable-weight processes mo...
Implementing Threads

                    • Components
                      – chores: the work that is to be done
       ...
Scheduling

                     • Chores on threads
                       – event loops
                     • Threads o...
Multiplexing Processors


                                                            Blocked
                            ...
One-Level Model


                                                                   User


                              ...
Variable-Weight Processes

                     • Variant of one-level model
                     • Portions of parent pro...
Cloning
                                           Signal
                                            Info
               ...
Linux Threads
                                           (pre 2.6)


                        Initial
                     ...
Two-Level Model
                                 One Kernel Thread



                                                    ...
Two-Level Model:
                             Multiple Kernel Threads



                                                 ...
Scheduler Activations




                                                                                                ...
Scheduler Activations
                                       (continued)




                                             ...
Scheduler Activations
                                     (still continued)




                                         ...
Performance

                     • One-level model
                       – operations on threads are expensive (require
...
Performance
                                        (continued)

                     • Two-level model (good news)
      ...
Performance
                                    (still continued)

                     • Two-level model (not-so-good new...
Performance
                                (yet still continued)

                    • Two-level model (more bad news)
 ...
Performance
                                  (notes continued)

                     • Two-level model (more bad news)
  ...
Performance
                                      (final word)

                    • Scheduler activations model
        ...
Upcoming SlideShare
Loading in...5
×

06threadsimp

2,066

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,066
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "06threadsimp"

  1. 1. Threads Implementations CS 167 VI–1 Copyright © 2006 Thomas W. Doeppner. All rights reserved. VI–1
  2. 2. Outline • Threads implementations – the one-level model – the variable-weight processes model – the two-level model (one kernel thread) – the two-level model (multiple kernel threads) – the scheduler-activations model – performance CS 167 VI–2 Copyright © 2006 Thomas W. Doeppner. All rights reserved. VI–2
  3. 3. Implementing Threads • Components – chores: the work that is to be done – processors: the active agents – threads: the execution context CS 167 VI–3 Copyright © 2006 Thomas W. Doeppner. All rights reserved. Before we discuss how threads are implemented, let’s introduce some terms that will be helpful in discussing implementations. It’s often convenient to separate the notion of the work that is being done (e.g., the computation that is to be performed) from the notion of some active agent who is doing the work. So we call the former a chore, and the latter a processor. Examples of the former include computing the value of π, looking up an entry in a database, and redrawing a window. To model systems in sufficient detail to discuss performance, we use a further abstraction—contexts. When a processor executes code, its registers contain certain values, some of which define such things as the stack. These values must be loaded into a processor so that the processor can execute the code associated with handling a chore; conversely, this information must be saved if the processor is to be switched to executing the code associated with some other chore. We define a thread (or thread of control) to be this information; in other words, it is the execution context. VI–3
  4. 4. Scheduling • Chores on threads – event loops • Threads on processors – time-division multiplexing - explicit thread switching - time slicing CS 167 VI–4 Copyright © 2006 Thomas W. Doeppner. All rights reserved. An important aspect of multithreading is scheduling threads on processors. One might think that the most important aspect is the schedule: when is a particular thread chosen for execution by a processor? This is certainly not unimportant, but what is perhaps more crucial is where the scheduling takes place and what sort of context is being scheduled. The simplest system would be to handle a single chore in the context of a single thread being executed by a single processor. This would be a rather limited system: it would perform that chore and nothing else. A simple multiprocessor system would consist of multiple chores each handled in the context of a separate thread, each being executed by a separate processor. More realistic systems handle many chores, both sequentially and concurrently. This requires some sort of multiplexing mechanism. One might handle multiple chores with a single thread: this is the approach used in event-handling systems, in which a thread is used to support an event loop. In response to an event, the thread is assigned an associated chore (and the processor is directed to handle that chore). Time-division multiplexing allows chores to be handled concurrently and can be done by dividing up a processor’s time among a number of threads. The mechanism for doing this might be time-slicing: assigning the processor to a thread for a certain amount of time before assigning it to another thread, or explicit: code in the chore releases the processor so that it may switch to another thread. In either case, a scheduler is employed to determine which threads should be assigned processors. VI–4
  5. 5. Multiplexing Processors Blocked Runnable Keyboard Running Runnable Blocked Running Disk Runnable CS 167 VI–5 Copyright © 2006 Thomas W. Doeppner. All rights reserved. To be a bit more precise about scheduling, let’s define some more (standard) terms. Threads are in either a blocked state or a runnable state: in the former they cannot be assigned a processor, in the latter they can. A scheduler determines which runnable threads should be assigned processors. Runnable threads that have been assigned activities are called running threads. VI–5
  6. 6. One-Level Model User Kernel Processors CS 167 VI–6 Copyright © 2006 Thomas W. Doeppner. All rights reserved. In most systems there are actually two components of the execution context: the user context and the kernel context. The former is for use when an activity is executing user code; the latter is for use when the activity is executing kernel code (on behalf of the chore). How these contexts are manipulated is one of the more crucial aspects of a threads implementation. The conceptually simplest approach is what is known as the one-level model: each thread consists of both contexts. Thus a thread is scheduled to an activity and the activity can switch back and forth between the two types of contexts. A single scheduler in the kernel can handle all the multiplexing duties. The threading implementation in Windows is (mostly) done this way. VI–6
  7. 7. Variable-Weight Processes • Variant of one-level model • Portions of parent process selectively copied into or shared with child process • Children created using clone system call CS 167 VI–7 Copyright © 2006 Thomas W. Doeppner. All rights reserved. Unlike most other Unix systems, which make a distinction between processes and threads, allowing multithreaded processes, Linux maintains the one-thread-per-process approach. However, so that we can have multiple threads sharing an address space, Linux supports the clone system call, a variant of fork, via which a new process can be created that shares resources (in particular, its address space) with the parent. The result is a variant of the one- level model. This approach is not unique to Linux. It’s used in SGI’s IRIX and was first discussed in early ’89, when it was known as variable-weight processes. (See “Variable-Weight Processes with Flexible Shared Resources,” by Z. Aral, J. Bloom, T. Doeppner, I. Gertner, A. Langerman, G. Schaffer, Proceedings of Winter 1989 USENIX Association Meeting.) VI–7
  8. 8. Cloning Signal Info Parent Child Files: file-descriptor table FS: root, cwd, umask Virtual Memory CS 167 VI–8 Copyright © 2006 Thomas W. Doeppner. All rights reserved. As implemented in Linux, a process may be created with the clone system call (in addition to using the fork system call). One can specify, for each of the resources shown in the slide, whether a copy is made for the child or the child shares the resource with the parent. Only two cases are generally used: everything is copied (equivalent to fork) or everything is shared (creating what we ordinarily call a thread, though the “thread” has a separate process ID). VI–8
  9. 9. Linux Threads (pre 2.6) Initial Thread Manager Thread Other Pipe Thread Other Thread Other Thread CS 167 VI–9 Copyright © 2006 Thomas W. Doeppner. All rights reserved. Building a POSIX-threads implementation on top of Linux’s variable-weight processes requires some work. What’s discussed here is the approach used prior to Linux 2.6. Some information about the threads implementation of 2.6 can be found at http://people.redhat.com/drepper/nptl-design.pdf. Each thread is, of course, a process; all threads of the same computation share the same address space, open files, and signal handlers. One might expect that the implementation of pthread_create would be a simple call to clone. This, unfortunately, wouldn’t allow an easy implementation of operations such as pthread_join: a Unix process may wait only for its children to terminate; a POSIX thread can join with any other joinable thread. Furthermore, if a Unix process terminates, its children are inherited by the init process (process number 1). So that pthread_join can be implemented without undue complexity, a special manager thread (actually a process) is the parent/creator of all threads other than the initial thread. This manager thread handles thread (process) termination via the wait4 system call and thus provides a means for implementing pthread_join. So, when any thread invokes pthread_create or pthread_join, it sends a request to the manager via a pipe and waits for a response. The manager handles the request and wakes up the caller when appropriate. The state of a mutex is represented by a bit. If there are no competitors for locking a mutex, a thread simply sets the bit with a compare-and-swap instruction (allowing atomic testing and setting of the mutex’s state bit). If a thread must wait for a mutex to be unlocked, it blocks using a sigsuspend system call, after queuing itself to a queue headed by the mutex. A thread unlocking a mutex wakes up the first waiting thread by sending it a Unix signal (via the kill system call). The wait queue for condition variables is implemented in a similar fashion. On multiprocessors, for mutexes that are neither recursive nor error-checking, waiting is implemented with an adaptive strategy: under the assumption that mutexes are typically not held for a long period of time, a thread attempting to lock a locked mutex “spins” on it for up to a short period of time, i.e., it repeatedly tests the state of the mutex in hopes that it will be unlocked. If the mutex does not become available after the maximum number of tests, then the thread finally blocks by queuing itself and calling sigsuspend. VI–9
  10. 10. Two-Level Model One Kernel Thread User Kernel Processors CS 167 VI–10 Copyright © 2006 Thomas W. Doeppner. All rights reserved. Another approach, the two-level model, is to represent the two contexts as separate types of threads: user threads and kernel threads. Kernel threads become “virtual activities” upon which user threads are scheduled. Thus two schedulers are used: kernel threads are multiplexed on activities by a kernel scheduler; user threads are multiplexed on kernel threads by a user-level scheduler. An extreme case of this model is to use only a single kernel thread per process (perhaps because this is all the operating system supports). The Unix implementation of the Netscape web browser was based on this model (recent Solaris versions use the native Solaris implementation of threads), as were early Unix threads implementations. There are two obvious disadvantages of this approach, both resulting from the restriction of a single kernel thread per process: only one activity can be used at a time (thus a single process cannot take advantage of a multiprocessor) and if the kernel thread is blocked (e.g., as part of an I/O operation), no user thread can run. VI–10
  11. 11. Two-Level Model: Multiple Kernel Threads User Kernel Processors CS 167 VI–11 Copyright © 2006 Thomas W. Doeppner. All rights reserved. A more elaborate use of the two-level model is to allow multiple kernel threads per process. This deals with both the disadvantages described above and is the basis of the Solaris implementation of threading. It has some performance issues; in addition the notion of multiplexing user threads onto kernel threads is very different from the notion of multiplexing threads onto activities—there is no direct control over when a chore is actually run by an activity. From an application’s perspective, it is sometimes desired to have direct control over which chores are currently being run. VI–11
  12. 12. Scheduler Activations User Kernel CS 167 VI–12 Copyright © 2006 Thomas W. Doeppner. All rights reserved. A third approach, known historically as the scheduler activations model, is that threads represent user contexts, with kernel contexts supplied when needed (i.e., not as a kernel thread, as in the two-level model). User threads are multiplexed on activities by a user-level scheduler, which communicates to the kernel the number of activities needed (i.e., the number of ready user threads). The kernel multiplexes entire processes on activities—it determines how many activities to give each process. This model, which is the basis for the Digital-Unix (now True64 Unix) threading package, certainly gives direct control to the user application over which chores are being run. To make some sense of this, let’s work through an example. A process starts up, containing a single user execution context (and user thread) and a kernel execution context (and kernel thread). Following the dictates of its scheduling policy, the kernel scheduler assigns a processor to the process. If the kernel thread blocks, the process implicitly relinquishes the processor to the kernel scheduler, and gets it back once it unblocks. Suppose that the user program creates a new thread (and its associated user execution context). If actual parallelism is desired, code in the user-level library notifies the kernel that two processors are desired. When a processor becomes available, the kernel creates a new kernel execution context; using the newly available processor running in the new kernel execution context, it places an upcall (going from system code to user code, unlike a system call, which goes from user code to system code) to the user-level library, effectively giving it the processor. The library code then assigns this processor to the new thread and user execution context. VI–12
  13. 13. Scheduler Activations (continued) User Kernel CS 167 VI–13 Copyright © 2006 Thomas W. Doeppner. All rights reserved. The user application might then create another thread. It might also ask for another processor, but this machine only has two. However, let’s say that one of its other threads (thread 1) blocks on a page fault. The kernel, getting the processor back, creates a new kernel execution context and places another upcall to our process, telling it two things: • The thread using kernel execution context 1 has blocked, and thus it has lost its processor (processor 1). • Here is processor 1, can you use it? In our case the process will assign the processor to thread 3. But soon the page being waited for by thread 1 becomes available. The kernel should notify the process of this event, but, of course, it requires a processor to do this. So it uses one of the processors already assigned to the process, the one the process has assigned to thread 2. The process is now notified of the following two events: • The thread using kernel execution context 1 has unblocked (i.e., it would be running, if only it had a processor). • I’m telling you this using processor 2, which I’ve taken from the thread that was using kernel execution context 2. The library now must decide what to do with the processor that has been handed to it. It could give it back to thread 2, leaving thread 1 unblocked, but not running, in the kernel, it could continue the suspension of thread 2 and give the processor to thread 1, or it could decide that both threads 1 and 2 should be running now and thus suspend thread 3, give its processor to thread 1, and give thread 2 its processor back. VI–13
  14. 14. Scheduler Activations (still continued) User Kernel CS 167 VI–14 Copyright © 2006 Thomas W. Doeppner. All rights reserved. At some point the kernel is going to decide that the process has had one or both processors long enough (e.g., a time slice has expired). So it yanks one of the processors away and, using the other processor, makes an upcall conveying the following news: • I’ve taken processor 1. • I’m telling you this using processor 2. The library learns that it now has only one processor, but with this knowledge it can assign the processor to the most deserving thread. VI–14
  15. 15. Performance • One-level model – operations on threads are expensive (require system calls) – example: mutual exclusion in Windows - critical section implemented partly in user mode • success case in user code - mutex implemented completely in kernel - success case is 20 times faster for critical section than for mutex CS 167 VI–15 Copyright © 2006 Thomas W. Doeppner. All rights reserved. The one-level model is the most straightforward—unlike the others, there is but a single scheduler. This scheduler resides in the kernel; hence the bulk of the data structures and code required to represent and manipulate threads is in the kernel (though not all, as we discuss below). Thus many thread operations, such as synchronization, thread creation and destruction, involve calls to kernel code—system calls. Since in most architectures such calls (from user code) are significantly more expensive than calls to user code, threading implementations based on the one-level model are prone to high operation costs. To illustrate the performance penalty incurred when performing a system call, we measured the costs of performing an operation both in user space and in the kernel in Windows NT 4.0. The operation, waiting for an object to become unlocked and then locking it, is performed frequently by numerous applications and has a highly optimized implementation, especially for the case we exercised, in which the object is not previously locked. NT provides two constructs for doing this—the critical section and the mutex. The former is implemented partly in user code and partly in kernel code. If the object in question (represented by the critical section) is not locked, the critical section operates strictly in user mode. The mutex is implemented entirely in kernel code: regardless of its state, operations on it involve system calls. Our measurements show that requests to lock, then unlock a mutex take twenty times longer than to lock and unlock a critical section when the mutex and critical section are not already locked. (Operations on the two take the same amount of time when the mutex and critical section are locked by another thread.) VI–15
  16. 16. Performance (continued) • Two-level model (good news) – many operations on threads are done strictly in user code: no system calls CS 167 VI–16 Copyright © 2006 Thomas W. Doeppner. All rights reserved. The two-level model makes it possible to eliminate some of the overhead of the one-level model. Since user threads are multiplexed on kernel threads, the thread that is directly manipulated by application code is the user thread, implemented entirely in user-level code. As long as operations on user threads do not involve operations on kernel threads, all execution takes place at user level and thus the cost of calling kernel code is avoided. The trick, of course, is to avoid operations on kernel threads. The user-level library maintains a ready list of runnable user threads. When a running user thread must block for synchronization, it is put on a wait queue and its kernel thread switches to run the user thread at the head of the ready list. If the list is empty, the thread executes a system call to cause it to block. When one user thread unblocks another, the latter is moved to the end of the ready list. If kernel threads are available (and thus the ready list was empty), a system call is required to wake up the kernel thread so that it can run the unblocked user thread. Thus operations on user threads induce (expensive) operations on kernel threads if there is a surplus of kernel threads. VI–16
  17. 17. Performance (still continued) • Two-level model (not-so-good news) – if not enough kernel threads, deadlock is possible - Solaris automatically creates a new kernel thread if all are blocked CS 167 VI–17 Copyright © 2006 Thomas W. Doeppner. All rights reserved. Runtime decisions must be made about kernel threads. How many should there be? When should they be created? With the single-kernel-thread version of the model, these questions are answered trivially; coping with these questions for the general model is a very important aspect. One concern is deadlock. For example, suppose a process has two chores being handled by two user threads but just one kernel thread. The kernel thread has been assigned to one of the user threads and is blocked. There is code that could be executed in the other user thread that would unblock the first, but since no kernel thread is available, this code will never be executed—both user threads (and the kernel thread) are blocked forever. (This scenario could happen on a Unix system, for example, if the two user threads are communicating via a pipe and the first is blocked on a read, waiting for the other to do a write.) In Solaris, this problem is prevented with the aid of the operating system. If the OS detects that all of a process’s kernel threads are blocked, it notifies the user-level threads code, which creates a new kernel thread if there are user threads that are runnable. VI–17
  18. 18. Performance (yet still continued) • Two-level model (more bad news) – loss of parallelism if not enough kernel threads - use pthread_setconcurrency in Solaris – excessive overhead if too many kernel threads CS 167 VI–18 Copyright © 2006 Thomas W. Doeppner. All rights reserved. What if there are not enough kernel threads? As discussed in the previous page, the Solaris kernel insures that there are enough kernel threads to prevent deadlock. However, we might have a situation in which there are two processors and two kernel threads. One kernel thread is blocked (its user thread is waiting on I/O); the other kernel thread is running (its user thread is in a compute loop). If there is another runnable user thread, it won’t be able to run until a kernel thread becomes available, even though there is an available processor. (A new one won’t automatically be created, since this is done only when all of a process’s kernel threads are blocked.) One overcomes this problem in Solaris by using the pthread_setconcurrency routine to set a lower bound on the number of kernel threads used by a process. What if there are too many kernel threads? One result would be that at times there might be more ready kernel threads than activities and thus the threads’ execution would be time- sliced. Unless there are vastly too many threads, this would cause no noticeable problems. However, there are more subtle issues. For example, suppose two user threads are each handling a separate chore, with synchronization constructs being used to alternate the threads’ executions to ensure that they are not executed simultaneously. If we use just a single kernel thread, the synchronization is handled entirely in user space: first one user thread runs, then joins a wait queue; the kernel thread runs the other user thread, which soon releases the first user thread and joins a wait queue itself, and so forth. The kernel thread alternates running the two user threads and execution never enters the kernel. VI–18
  19. 19. Performance (notes continued) • Two-level model (more bad news) – loss of parallelism if not enough kernel threads - use pthread_setconcurrency in Solaris – excessive overhead if too many kernel threads CS 167 VI–19 Copyright © 2006 Thomas W. Doeppner. All rights reserved. Now suppose we add another kernel thread. When one user thread is released from waiting, since a kernel thread is available, that thread is woken up (via a system call) to run the waking user thread. When the first user thread subsequently blocks, its kernel thread has nothing to do and must perform a system call to block in the kernel. Thus each user thread runs on a separate kernel thread and system calls are required to repeatedly block and release the kernel threads. We performed exactly this experiment on Solaris 2.6. Two user threads alternated their execution one million times, using semaphores for synchronization. The total running time was 24.6 seconds when one kernel thread was used, but was 68.5 seconds when two kernel threads were used—a slowdown of almost a factor of three. VI–19
  20. 20. Performance (final word) • Scheduler activations model – no problems with too few or too many kernel threads - (it doesn’t have any) CS 167 VI–20 Copyright © 2006 Thomas W. Doeppner. All rights reserved. The scheduler-activations model, since it has no kernel threads at all, clearly has no problems resulting from having either too many or too few of them. Since threads are represented entirely in user space, operations on them are relatively cheap. The kernel, knowing exactly how many ready threads each process has, ensures that no activity is needlessly idle. VI–20

×