geom(4)
NAME
GEOM - modular disk I/O request transformation framework.
DESCRIPTION
- The GEOM framework provides an infrastructure in which
- "classes" can perform transformations on disk I/O requests on their path from
- the upper
kernel to the device drivers and back. - Transformations in a GEOM context range from the simple geo
- metric displacement performed in typical disk partitioning modules
- over RAID algorithms and device multipath resolution to full blown crypto
- graphic protection of the stored data.
- Compared to traditional "volume management", GEOM differs
- from most and
in some cases all previous implementations in the following - ways:
- +o GEOM is extensible. It is trivially simple to write a
- new class of
- transformation and it will not be given stepchild treat
- ment. If
someone for some reason wanted to mount IBM MVS - diskpacks, a class
recognizing and configuring their VTOC information would - be a trivial
matter. - +o GEOM is topologically agnostic. Most volume management
- implementa
- tions have very strict notions of how classes can fit
- together, very
often one fixed hierarchy is provided for instance sub - disk - plex volume.
- Being extensible means that new transformations are treated
- no differently than existing transformations.
- Fixed hierarchies are bad because they make it impossible to
- express the
intent efficiently. In the fixed hierarchy above it is not - possible to
mirror two physical disks and then partition the mirror into - subdisks,
instead one is forced to make subdisks on the physical vol - umes and to
mirror these two and two resulting in a much more complex - configuration.
GEOM on the other hand does not care in which order things - are done, the
only restriction is that cycles in the graph will not be al - lowed.
TERMINOLOGY and TOPOLOGY
- GEOM is quite object oriented and consequently the terminol
- ogy borrows a
lot of context and semantics from the OO vocabulary: - A "class", represented by the data structure g_class imple
- ments one particular kind of transformation. Typical examples are MBR
- disk partition,
BSD disklabel, and RAID5 classes. - An instance of a class is called a "geom" and represented by
- the data
structure "g_geom". In a typical i386 FreeBSD system, there - will be one
geom of class MBR for each disk. - A "provider", represented by the data structure
- "g_provider", is the
front gate at which a geom offers service. A provider is "a - disk-like
thing which appears in /dev" - a logical disk in other - words. All
providers have three main properties: name, sectorsize and - size.
- A "consumer" is the backdoor through which a geom connects
- to another
geom provider and through which I/O requests are sent. - The topological relationship between these entities are as
- follows:
- +o A class has zero or more geom instances.
- +o A geom has exactly one class it is derived from.
- +o A geom has zero or more consumers.
- +o A geom has zero or more providers.
- +o A consumer can be attached to zero or one providers.
- +o A provider can have zero or more consumers attached.
- All geoms have a rank-number assigned, which is used to de
- tect and prevent loops in the acyclic directed graph. This rank number
- is assigned
as follows: - 1. A geom with no attached consumers has rank=1
- 2. A geom with attached consumers has a rank one higher
- than the high
est rank of the geoms of the providers its consumers - are attached
to.
SPECIAL TOPOLOGICAL MANEUVERS
- In addition to the straightforward attach, which attaches a
- consumer to a
provider, and detach, which breaks the bond, a number of - special topological maneuvers exists to facilitate configuration and to im
- prove the
overall flexibility. - TASTING is a process that happens whenever a new class or
- new provider is
created and it provides the class a chance to automatically - configure an
instance on providers, which it recognize as its own. A - typical example
is the MBR disk-partition class which will look for the MBR - table in the
first sector and if found and validated it will instantiate - a geom to
multiplex according to the contents of the MBR. - A new class will be offered to all existing providers in
- turn and a new
provider will be offered to all classes in turn. - Exactly what a class does to recognize if it should accept
- the offered
provider is not defined by GEOM, but the sensible set of op - tions are:
- +o Examine specific data structures on the disk.
- +o Examine properties like sectorsize or mediasize for the
- provider.
- +o Examine the rank number of the provider's geom.
- +o Examine the method name of the provider's geom.
- ORPHANIZATION is the process by which a provider is removed
- while it
potentially is still being used. - When a geom orphans a provider, all future I/O requests will
- "bounce" on
the provider with an error code set by the geom. Any con - sumers attached
to the provider will receive notification about the orpha - nization when
the eventloop gets around to it, and they can take appropri - ate action at
that time. - A geom which came into being as a result of a normal taste
- operation
should selfdestruct unless it has a way to keep functioning - lacking the
orphaned provider. Geoms like diskslicers should therefore - selfdestruct
whereas RAID5 or mirror geoms will be able to continue, as - long as they
do not loose quorum. - When a provider is orphaned, this does not necessarily re
- sult in any
immediate change in the topology: any attached consumers are - still
attached, any opened paths are still open, any outstanding - I/O requests
are still outstanding. - The typical scenario is
+o A device driver detects a disk has departed and - orphans the
provider for it. - +o The geoms on top of the disk receive the orpha
- nization event
and orphans all their providers in turn. - Providers, which are
not attached to, will typically self-destruct - right away. This
process continues in a quasi-recursive fashion un - til all relevant pieces of the tree has heard the bad news.
- +o Eventually the buck stops when it reaches geom_dev
- at the top
of the stack. - +o Geom_dev will call destroy_dev(9) to stop any more
- request fromcoming in. It will sleep until all (if any) out
- standing I/O
requests have been returned. It will explicitly - close (ie:
zero the access counts), a change which will prop - agate all the
way down through the mesh. It will then detach - and destroy its
geom. - +o The geom whose provider is now attached will de
- stroy theprovider, detach and destroy its consumer and de
- stroy its geom.
- +o This process percolates all the way down through
- the mesh,
until the cleanup is complete. - While this approach seems byzantine, it does provide the
- maximum flexibility and robustness in handling disappearing devices.
- The one absolutely crucial detail to be aware is that if the
- device
driver does not return all I/O requests, the tree will not - unravel.
- SPOILING is a special case of orphanization used to protect
- against stale
metadata. It is probably easiest to understand spoiling by - going through
an example. - Imagine a disk, "da0" on top of which a MBR geom provides
- "da0s1" and
"da0s2" and on top of "da0s1" a BSD geom provides "da0s1a" - through
"da0s1e", both the MBR and BSD geoms have autoconfigured - based on data
structures on the disk media. Now imagine the case where - "da0" is opened
for writing and those data structures are modified or over - written: Now
the geoms would be operating on stale metadata unless some - notification
system can inform them otherwise. - To avoid this situation, when the open of "da0" for write
- happens, all
attached consumers are told about this, and geoms like MBR - and BSD will
selfdestruct as a result. When "da0" is closed again, it - will be offered
for tasting again and if the data structures for MBR and BSD - are still
there, new geoms will instantiate themselves anew. - Now for the fine print:
- If any of the paths through the MBR or BSD module were open,
- they would
have opened downwards with an exclusive bit rendering it im - possible to
open "da0" for writing in that case and conversely the re - quested exclusive bit would render it impossible to open a path through
- the MBR geom
while "da0" is open for writing. - From this it also follows that changing the size of open ge
- oms can only
be done with their cooperation. - Finally: the spoiling only happens when the write count goes
- from zero to
non-zero and the retasting only when the write count goes - from non-zero
to zero. - INSERT/DELETE are a very special operation which allows a
- new geom to be
instantiated between a consumer and a provider attached to - each other and
to remove it again. - To understand the utility of this, imagine a provider with
- being mounted
as a file system. Between the DEVFS geoms consumer and its - provider we
insert a mirror module which configures itself with one mir - ror copy and
consequently is transparent to the I/O requests on the path. - We can now
configure yet a mirror copy on the mirror geom, request a - synchronization, and finally drop the first mirror copy. We have now
- in essence
moved a mounted file system from one disk to another while - it was being
used. At this point the mirror geom can be deleted from the - path again,
it has served its purpose. - CONFIGURE is the process where the administrator issues in
- structions for
a particular class to instantiate itself. There are multi - ple ways to
express intent in this case, a particular provider can be - specified with
a level of override forcing for instance a BSD disklabel - module to attach
to a provider which was not found palatable during the TASTE - operation.
- Finally IO is the reason we even do this: it concerns itself
- with sending
I/O requests through the graph. - I/O REQUESTS represented by struct bio, originate at a con
- sumer, are
scheduled on its attached provider, and when processed, re - turned to the
consumer. It is important to realize that the struct bio - which enters
through the provider of a particular geom does not "come out - on the other
side". Even simple transformations like MBR and BSD will - clone the
struct bio, modify the clone, and schedule the clone on - their own consumer. Note that cloning the struct bio does not involve
- cloning the
actual data area specified in the IO request. - In total four different IO requests exist in GEOM: read,
- write, delete,
and get attribute. - Read and write are self explanatory.
- Delete indicates that a certain range of data is no longer
- used and that
it can be erased or freed as the underlying technology sup - ports. Technologies like flash adaptation layers can arrange to erase
- the relevant
blocks before they will become reassigned and cryptographic - devices may
want to fill random bits into the range to reduce the amount - of data
available for attack. - It is important to recognize that a delete indication is not
- a request
and consequently there is no guarantee that the data actual - ly will be
erased or made unavailable unless guaranteed by specific ge - oms in the
graph. If "secure delete" semantics are required, a geom - should be
pushed which converts delete indications into (a sequence - of) write
requests. - Get attribute supports inspection and manipulation of out
- of-band
attributes on a particular provider or path. Attributes are - named by
ascii strings and they will be discussed in a separate sec - tion below.
- (stay tuned while the author rests his brain and fingers:
- more to come.)
DIAGNOSTICS
- Several flags are provided for tracing GEOM operations and
- unlocking protection mechanisms via the kern.geom.debugflags sysctl. All
- of these
flags are off by default, and great care should be taken in - turning them
on. - 0x01 (G_T_TOPOLOGY)
Provide tracing of topology change events. - 0x02 (G_T_BIO)
Provide tracing of buffer I/O requests. - 0x04 (G_T_ACCESS)
Provide tracing of access check controls. - 0x08 (unused)
- 0x10 (allow foot shooting)
Allow writing to Rank 1 providers. This would, for - example, allow
the super-user to overwrite the MBR on the root disk - or write random sectors elsewhere to a mounted disk. The implica
- tions are
obvious. - 0x20 (G_T_DETAILS)
This appears to be unused at this time. - 0x40 (G_F_DISKIOCTL)
This appears to be unused at this time. - 0x80 (G_F_CTLDUMP)
Dump contents of gctl requests.
HISTORY
- This software was developed for the FreeBSD Project by Poul
- Henning Kamp
and NAI Labs, the Security Research Division of Network As - sociates, Inc.
under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as - part of the
DARPA CHATS research program. - The first precursor for GEOM was a gruesome hack to Minix
- 1.2 and was
never distributed. An earlier attempt to implement a less - general scheme
in FreeBSD never succeeded.
AUTHORS
- Poul-Henning Kamp <phk@FreeBSD.org>
- BSD March 27, 2002