kse(2)
NAME
kse - kernel support for user threads
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include <sys/types.h> #include <sys/kse.h> int kse_create(struct kse_mailbox *mbx, int newgroup); int kse_exit(void); int kse_release(struct timespec *timeout); int kse_switchin(mcontext_t *mcp, long val, long *loc); int kse_thr_interrupt(struct kse_thr_mailbox *tmbx); int kse_wakeup(struct kse_mailbox *mbx);
DESCRIPTION
- These system calls implement kernel support for multi
- threaded processes.
- Overview
- Traditionally, user threading has been implemented in one of
- two ways:
either all threads are managed in user space and the kernel - is unaware of
any threading (also known as ``N to 1''), or else separate - processes
sharing a common memory space are created for each thread - (also known as
``N to N''). These approaches have advantages and disadvan - tages:
- User threading Kernel threading
+ Lightweight - Heavyweight
+ User controls scheduling - Kernel controls schedul - ing
- Syscalls must be wrapped + No syscall wrapping re - quired
- Cannot utilize multiple CPUs + Can utilize multiple - CPUs
- The KSE system is a hybrid approach that achieves the advan
- tages of both
the user and kernel threading approaches. The underlying - philosophy of
the KSE system is to give kernel support for user threading - without taking away any of the user threading library's ability to make
- scheduling
decisions. A kernel-to-user upcall mechanism is used to - pass control to
the user threading library whenever a scheduling decision - needs to be
made. An arbitrarily number of user threads are multiplexed - onto a fixed
number of virtual CPUs supplied by the kernel. This can be - thought of as
an ``N to M'' threading scheme. - Some general implications of this approach include:
- +o The user process can run multiple threads simultaneously
- on multi
processor machines. The kernel grants the process virtual CPUs to
schedule as it wishes; these may run concurrently on real CPUs. - +o All operations that block in the kernel become asyn
- chronous, allowing
the user process to schedule another thread when anythread blocks.
- +o Multiple thread schedulers within the same process are
- possible, and
they may operate independently of each other.
- Definitions
- KSE allows a user process to have multiple threads of execu
- tion in existence at the same time, some of which may be blocked in the
- kernel while
others may be executing or blocked in user space. A kernel - scheduling
entity (KSE) is a ``virtual CPU'' granted to the process for - the purpose
of executing threads. A thread that is currently executing - is always
associated with exactly one KSE, whether executing in user - space or in
the kernel. The KSE is said to be assigned to the thread. - The KSE becomes unassigned, and the associated thread is
- suspended, when
the KSE has an associated mailbox, (see below) the thread - has an associated thread mailbox, (also see below) and any of the follow
- ing occurs:
- +o The thread invokes a system call that blocks.
- +o The thread makes any other demand of the kernel that
- cannot be imme
diately satisfied, e.g., touches a page of memory thatneeds to be
fetched from disk, causing a page fault. - +o Another thread that was previously blocked in the kernel
- completes
its work in the kernel (or is interrupted) and becomesready to
return to user space, and the current thread is returning to user
space. - +o A signal is delivered to the process, and this KSE is
- chosen to
deliver it.
- In other words, as soon as there is a scheduling decision to
- be made, the
KSE becomes unassigned, because the kernel does not presume - to know how
the process' other runnable threads should be scheduled. - Unassigned KSEs
always return to user space as soon as possible via the - upcall mechanism
(described below), allowing the user process to decide how - that KSE
should be utilized next. KSEs always complete as much work - as possible
in the kernel before becoming unassigned. - A KSE group is a collection of KSEs that are scheduled uni
- formly and
which share access to the same pool of threads, which are - associated with
the KSE group. A KSE group is the smallest entity to which - a kernel
scheduling priority may be assigned. For the purposes of - process
scheduling and accounting, each KSE group counts similarly - to a traditional unthreaded process. Individual KSEs within a KSE
- group are effectively indistinguishable, and any KSE in a KSE group may be
- assigned by
the kernel to any runnable (in the kernel) thread associated - with that
KSE group. In practice, the kernel attempts to preserve the - affinity
between threads and actual CPUs to optimize cache behavior, - but this is
invisible to the user process. (Affinity is not yet imple - mented.)
- Each KSE has a unique KSE mailbox supplied by the user pro
- cess. A mailbox consists of a control structure containing a pointer to
- an upcall
function and a user stack. The KSE invokes this function - whenever it
becomes unassigned. The kernel updates this structure with - information
about threads that have become runnable and signals that - have been delivered before each upcall. Upcalls may be temporarily blocked
- by the user
thread scheduling code during critical sections. - Each user thread has a unique thread mailbox as well.
- Threads are
referred to using pointers to these mailboxes when communi - cating between
the kernel and the user thread scheduler. Each KSE's mail - box contains a
pointer to the mailbox of the user thread that the KSE is - currently executing. This pointer is saved when the thread blocks in the
- kernel.
- Whenever a thread blocked in the kernel is ready to return
- to user space,
it is added to the KSE group's list of completed threads. - This list is
presented to the user code at the next upcall as a linked - list of thread
mailboxes. - There is a kernel-imposed limit on the number of threads in
- a KSE group
that may be simultaneously blocked in the kernel (this num - ber is not currently visible to the user). When this limit is reached,
- upcalls are
blocked and no work is performed for the KSE group until one - of the
threads completes (or a signal is received). - Managing KSEs
- To become multi-threaded, a process must first invoke
- kse_create(). The
kse_create() system call creates a new KSE (except for the - very first
invocation; see below). The KSE will be associated with the - mailbox
pointed to by mbx. If newgroup is non-zero, a new KSE group - is also created containing the KSE. Otherwise, the new KSE is added to
- the current
KSE group. Newly created KSEs are initially unassigned; - therefore, they
will upcall immediately. - Each process initially has a single KSE in a single KSE
- group executing a
single user thread. Since the KSE does not have an associ - ated mailbox,
it must remain assigned to the thread and does not perform - any upcalls.
The result is the traditional, unthreaded mode of operation. - Therefore,
as a special case, the first call to kse_create() by this - initial thread
with newgroup equal to zero does not create a new KSE; in - stead, it simply
associates the current KSE with the supplied KSE mailbox, - and no immediate upcall results. However, an upcall will be triggered
- the next time
the thread blocks and the required conditions are met. - The kernel does not allow more KSEs to exist in a KSE group
- than the number of physical CPUs in the system (this number is available
- as the
sysctl(3) variable hw.ncpu). Having more KSEs than CPUs - would not add
any value to the user process, as the additional KSEs would - just compete
with each other for access to the real CPUs. Since the ex - tra KSEs would
always be side-lined, the result to the application would be - exactly the
same as having fewer KSEs. There may however be arbitrarily - many user
threads, and it is up to the user thread scheduler to handle - mapping the
application's user threads onto the available KSEs. - The kse_exit() system call causes the KSE assigned to the
- currently running thread to be destroyed. If this KSE is the last one in
- the KSE
group, there must be no remaining threads associated with - the KSE group
blocked in the kernel. This system call does not return un - less there is
an error. - As a special case, if the last remaining KSE in the last re
- maining KSE
group invokes this system call, then the KSE is not de - stroyed; instead,
the KSE just looses the association with its mailbox and - kse_exit()
returns normally. This returns the process to its original, - unthreaded
state. - The kse_release() system call is used to ``park'' the KSE
- assigned to the
currently running thread when it is not needed, e.g., when - there are more
available KSEs than runnable user threads. The thread con - verts to an
upcall but does not get scheduled until there is a new rea - son to do so,
e.g., a previously blocked thread becomes runnable, or the - timeout
expires. If successful, kse_release() does not return to - the caller.
- The kse_switchin() system call can be used by the UTS, when
- it has
selected a new thread, to switch to the context of that - thread. The use
of kse_switchin() is machine dependent. Some platforms do - not need a
system call to switch to a new context, while others require - its use in
particular cases. - The kse_wakeup() system call is the opposite of
- kse_release(). It causes
the (parked) KSE associated with the mailbox pointed to by - mbx to be
woken up, causing it to upcall. If the KSE has already wo - ken up for
another reason, this system call has no effect. The mbx ar - gument may be
NULL to specify ``any KSE in the current KSE group''. - The kse_thr_interrupt() system call is used to interrupt a
- currently
blocked thread. The thread must either be blocked in the - kernel or
assigned to a KSE (i.e., executing). The thread is then - marked as interrupted. As soon as the thread invokes an interruptible sys
- tem call (or
immediately for threads already blocked in one), the thread - will be made
runnable again, even though the kernel operation may not - have completed.
The effect on the interrupted system call is the same as if - it had been
interrupted by a signal; typically this means an error is - returned with
errno set to EINTR. - Signals
- The current implementation creates a special signal thread.
- Kernel
threads (KSEs) in a process mask all signals, and only the - signal thread
waits for signals to be delivered to the process, the signal - thread is
responsible for dispatching signals to user threads. - A downside of this is that if a multiplexed thread calls the
- execve()
syscall, its signal mask and pending signals may not be - available in the
kernel. They are stored in userland and the kernel does not - know where
to get them, however POSIX requires them to be restored and - passed them
to new process. Just setting the mask for the thread before - calling
execve() is only a close approximation to the problem as it - does not redeliver back to the kernel any pending signals that the old
- process may
have blocked, and it allows a window in which new signals - may be delivered to the process between the setting of the mask and the
- execve().
- For now this problem has been solved by adding a special
- combined
kse_thr_interrupt()/execve() mode to the kse_thr_interrupt() - syscall.
The kse_thr_interrupt() syscall has a sub command KSE_IN - TR_EXECVE, that
allows it to accept a kse_execv_args structure, and allowing - it to adjust
the signals and then atomically convert into an execve() - call. Additional pending signals and the correct signal mask can be
- passed to the
kernel in this way. The thread library overrides the - execve() syscall
and translates it into kse_intr_interrupt() call, allowing a - multiplexed
thread to restore pending signals and the correct signal - mask before
doing the exec(). This solution to the problem may change. - KSE Mailboxes
- Each KSE has a unique mailbox for user-kernel communication
- defined in
Some of the fields there are: - km_version describes the version of this structure and must
- be equal to
KSE_VER_0. km_udata is an opaque pointer ignored by the - kernel.
- km_func points to the KSE's upcall function; it will be in
- voked using
km_stack, which must remain valid for the lifetime of the - KSE.
- km_curthread always points to the thread that is currently
- assigned to
this KSE if any, or NULL otherwise. This field is modified - by both the
kernel and the user process as follows. - When km_curthread is not NULL, it is assumed to be pointing
- at the mailbox for the currently executing thread, and the KSE may be
- unassigned,
e.g., if the thread blocks in the kernel. The kernel will - then save the
contents of km_curthread with the blocked thread, set - km_curthread to
NULL, and upcall to invoke km_func(). - When km_curthread is NULL, the kernel will never perform any
- upcalls with
this KSE; in other words, the KSE remains assigned to the - thread even if
it blocks. km_curthread must be NULL while the KSE is exe - cuting critical
user thread scheduler code that would be disrupted by an in - tervening
upcall; in particular, while km_func() itself is executing. - Before invoking km_func() in any upcall, the kernel always
- sets
km_curthread to NULL. Once the user thread scheduler has - chosen a new
thread to run, it should point km_curthread at the thread's - mailbox, reenabling upcalls, and then resume the thread. Note: modifi
- cation of
km_curthread by the user thread scheduler must be atomic - with the loading
of the context of the new thread, to avoid the situation - where the thread
context area may be modified by a blocking async operation, - while there
is still valid information to be read out of it. - km_completed points to a linked list of user threads that
- have completed
their work in the kernel since the last upcall. The user - thread scheduler should put these threads back into its own runnable
- queue. Each
thread in a KSE group that completes a kernel operation - (synchronous or
asynchronous) that results in an upcall is guaranteed to be - linked into
exactly one KSE's km_completed list; which KSE in the group, - however, is
indeterminate. Furthermore, the completion will be reported - in only one
upcall. - km_sigscaught contains the list of signals caught by this
- process since
the previous upcall to any KSE in the process. As long as - there exists
one or more KSEs with an associated mailbox in the user pro - cess, signals
are delivered this way rather than the traditional way. - (This has not
been implemented and may change.) - km_timeofday is set by the kernel to the current system time
- before performing each upcall.
- km_flags may contain any of the following bits OR'ed togeth
- er:
- KMF_NOUPCALL
Block upcalls from happening. The thread is in somecritical
section. - KMF_NOCOMPLETED, KMF_DONE, KMF_BOUND
- This thread should be considered to be permanently
- bound to its
KSE, and treated much like a non-threaded process - would be. It
is a ``long term'' version of KMF_NOUPCALL in some - ways.
- KMF_WAITSIGEVENT
- Implement characteristics needed for the signal de
- livery thread.
- Thread Mailboxes
- Each user thread must have associated with it a unique
- struct
kse_thr_mailbox as defined in It includes the following - fields.
- tm_udata is an opaque pointer ignored by the kernel.
- tm_context stores the context for the thread when the thread
- is blocked
in user space. This field is also updated by the kernel be - fore a completed thread is returned to the user thread scheduler via
- km_completed.
- tm_next links the km_completed threads together when re
- turned by the kernel with an upcall. The end of the list is marked with a
- NULL pointer.
- tm_uticks and tm_sticks are time counters for user mode and
- kernel mode
execution, respectively. These counters count ticks of the - statistics
clock (see clocks(7)). While any thread is actively execut - ing in the
kernel, the corresponding tm_sticks counter is incremented. - While any
KSE is executing in user space and that KSE's km_curthread - pointer is not
equal to NULL, the corresponding tm_uticks counter is incre - mented.
- tm_flags may contain any of the following bits OR'ed togeth
- er:
- TMF_NOUPCALL
Similar to KMF_NOUPCALL. This flag inhibits upcalling for critical sections. Some architectures require this to bein one place
and some in the other.
RETURN VALUES
- The kse_create(), kse_wakeup(), and kse_thr_interrupt() sys
- tem calls
return zero if successful. The kse_exit() and kse_release() - system calls
do not return if successful. - All of these system calls return a non-zero error code in
- case of an
error.
ERRORS
The kse_create() system call will fail if:
- [ENXIO] There are already as many KSEs in the KSE
- group as
- hardware processors.
- [EAGAIN] The system-imposed limit on the total
- number of KSE
- groups under execution would be exceeded.
- The limit
is given by the sysctl(3) MIB variable - KERN_MAXPROC.
(The limit is actually ten less than this - except for
the super user.) - [EAGAIN] The user is not the super user, and the
- system-imposed
- limit on the total number of KSE groups
- under execution by a single user would be exceeded.
- The limit is
given by the sysctl(3) MIB variable
KERN_MAXPROCPERUID. - [EAGAIN] The user is not the super user, and the
- soft resource
- limit corresponding to the resource argu
- ment
RLIMIT_NPROC would be exceeded (see getr - limit(2)).
- [EFAULT] The mbx argument points to an address
- which is not a
- valid part of the process address space.
- The kse_exit() system call will fail if:
- [EDEADLK] The current KSE is the last in its KSE
- group and there
- are still one or more threads associated
- with the KSE
group blocked in the kernel. - [ESRCH] The current KSE has no associated mail
- box, i.e., the
- process is operating in traditional, un
- threaded mode
(in this case use _exit(2) to exit the - process).
- The kse_release() system call will fail if:
- [ESRCH] The current KSE has no associated mail
- box, i.e., the
- process is operating is traditional, un
- threaded mode.
- The kse_wakeup() system call will fail if:
- [ESRCH] The mbx argument is not NULL and the
- mailbox pointed
- to by mbx is not associated with any KSE
- in the process.
- [ESRCH] The mbx argument is NULL and the current
- KSE has no
- associated mailbox, i.e., the process is
- operating in
traditional, unthreaded mode. - The kse_thr_interrupt() system call will fail if:
- [ESRCH] The thread corresponding to tmbx is nei
- ther currently
- assigned to any KSE in the process nor
- blocked in the
kernel.
SEE ALSO
rfork(2), pthread(3), ucontext(3)
- Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,
- and Henry M.
Levy, "Scheduler activations: effective kernel support for - the user-level
management of parallelism", ACM Press, ACM Transactions on - Computer
Systems, Issue 1, Volume 10, pp. 53-79, February 1992.
HISTORY
The KSE system calls first appeared in FreeBSD 5.0.
AUTHORS
- KSE was originally implemented by Julian Elischer <ju
- lian@FreeBSD.org>,
with additional contributions by Jonathan Mini <mini@FreeB - SD.org>, Daniel
Eischen <deischen@FreeBSD.org>, and David Xu <davidxu@FreeB - SD.org>.
- This manual page was written by Archie Cobbs <archie@FreeB
- SD.org>.
BUGS
- The KSE code is currently under development.
- BSD September 10, 2002