buf(9)

NAME

buf - kernel buffer I/O scheme used in FreeBSD VM system

DESCRIPTION

The kernel implements a KVM abstraction of the buffer cache
which allows
it to map potentially disparate vm_page's into contiguous
KVM for use by
(mainly file system) devices and device I/O. This abstrac
tion supports
block sizes from DEV_BSIZE (usually 512) to upwards of sev
eral pages or
more. It also supports a relatively primitive byte-granular
valid range
and dirty range currently hardcoded for use by NFS. The
code implementing the VM Buffer abstraction is mostly concentrated in
/usr/src/sys/kern/vfs_bio.c.
One of the most important things to remember when dealing
with buffer
pointers (struct buf) is that the underlying pages are
mapped directly
from the buffer cache. No data copying occurs in the scheme
proper,
though some file systems such as UFS do have to copy a lit
tle when dealing with file fragments. The second most important thing to
remember is
that due to the underlying page mapping, the b_data base
pointer in a buf
is always *page* aligned, not *block* aligned. When you
have a VM buffer
representing some b_offset and b_size, the actual start of
the buffer is
(b_data + (b_offset & PAGE_MASK)) and not just b_data. Fi
nally, the VM
system's core buffer cache supports valid and dirty bits
(m->valid,
m->dirty) for pages in DEV_BSIZE chunks. Thus a platform
with a hardware
page size of 4096 bytes has 8 valid and 8 dirty bits. These
bits are
generally set and cleared in groups based on the device
block size of the
device backing the page. Complete page's worth are often
referred to
using the VM_PAGE_BITS_ALL bitmask (i.e., 0xFF if the hard
ware page size
is 4096).
VM buffers also keep track of a byte-granular dirty range
and valid
range. This feature is normally only used by the NFS sub
system. I am
not sure why it is used at all, actually, since we have
DEV_BSIZE
valid/dirty granularity within the VM buffer. If a buffer
dirty operation creates a 'hole', the dirty range will extend to cover
the hole. If
a buffer validation operation creates a 'hole' the byte
granular valid
range is left alone and will not take into account the new
extension.
Thus the whole byte-granular abstraction is considered a bad
hack and it
would be nice if we could get rid of it completely.
A VM buffer is capable of mapping the underlying VM cache
pages into KVM
in order to allow the kernel to directly manipulate the data
associated
with the (vnode,b_offset,b_size). The kernel typically un
maps VM buffers
the moment they are no longer needed but often keeps the
structure instantiated and even bp->b_pages array instanti
ated despite
having unmapped them from KVM. If a page making up a VM
buffer is about
to undergo I/O, the system typically unmaps it from KVM and
replaces the
page in the b_pages[] array with a place-marker called bo
gus_page. The
place-marker forces any kernel subsystems referencing the
associated
struct buf to re-lookup the associated page. I believe the
place-marker
hack is used to allow sophisticated devices such as file
system devices
to remap underlying pages in order to deal with, for exam
ple, re-mapping
a file fragment into a file block.
VM buffers are used to track I/O operations within the ker
nel. Unfortunately, the I/O implementation is also somewhat of a hack
because the
kernel wants to clear the dirty bit on the underlying pages
the moment it
queues the I/O to the VFS device, not when the physical I/O
is actually
initiated. This can create confusion within file system de
vices that use
delayed-writes because you wind up with pages marked clean
that are actually still dirty. If not treated carefully, these pages
could be thrown
away! Indeed, a number of serious bugs related to this hack
were not
fixed until the 2.2.8/3.0 release. The kernel uses an in
stantiated VM
buffer (i.e., struct buf) to place-mark pages in this spe
cial state. The
buffer is typically flagged B_DELWRI. When a device no
longer needs a
buffer it typically flags it as B_RELBUF. Due to the under
lying pages
being marked clean, the B_DELWRI|B_RELBUF combination must
be interpreted
to mean that the buffer is still actually dirty and must be
written to
its backing store before it can actually be released. In
the case where
B_DELWRI is not set, the underlying dirty pages are still
properly marked
as dirty and the buffer can be completely freed without los
ing that
clean/dirty state information. (XXX do we have to check
other flags in
regards to this situation ???)
The kernel reserves a portion of its KVM space to hold VM
Buffer's data
maps. Even though this is virtual space (since the buffers
are mapped
from the buffer cache), we cannot make it arbitrarily large
because
instantiated VM Buffers (struct buf's) prevent their under
lying pages in
the buffer cache from being freed. This can complicate the
life of the
paging system.

HISTORY

The buf manual page was originally written by Matthew Dillon
and first
appeared in FreeBSD 3.1, December 1998.
BSD December 22, 1998
Copyright © 2010-2025 Platon Technologies, s.r.o.           Home | Man pages | tLDP | Documents | Utilities | About
Design by styleshout