geom(4)
NAME
GEOM - modular disk I/O request transformation framework.
DESCRIPTION
- The GEOM framework provides an infrastructure in which
- "classes" can perform transformations on disk I/O requests on their path from
- the upper kernel to the device drivers and back.
- Transformations in a GEOM context range from the simple geo
- metric displacement performed in typical disk partitioning modules
- over RAID algorithms and device multipath resolution to full blown crypto
- graphic protection of the stored data.
- Compared to traditional "volume management", GEOM differs
- from most and in some cases all previous implementations in the following
- ways:
- +o GEOM is extensible. It is trivially simple to write a
- new class of
- transformation and it will not be given stepchild treat
- ment. If someone for some reason wanted to mount IBM MVS
- diskpacks, a class recognizing and configuring their VTOC information would
- be a trivial matter.
- +o GEOM is topologically agnostic. Most volume management
- implementa
- tions have very strict notions of how classes can fit
- together, very often one fixed hierarchy is provided for instance sub
- disk - plex volume.
- Being extensible means that new transformations are treated
- no differently than existing transformations.
- Fixed hierarchies are bad because they make it impossible to
- express the intent efficiently. In the fixed hierarchy above it is not
- possible to mirror two physical disks and then partition the mirror into
- subdisks, instead one is forced to make subdisks on the physical vol
- umes and to mirror these two and two resulting in a much more complex
- configuration. GEOM on the other hand does not care in which order things
- are done, the only restriction is that cycles in the graph will not be al
- lowed.
TERMINOLOGY and TOPOLOGY
- GEOM is quite object oriented and consequently the terminol
- ogy borrows a lot of context and semantics from the OO vocabulary:
- A "class", represented by the data structure g_class imple
- ments one particular kind of transformation. Typical examples are MBR
- disk partition, BSD disklabel, and RAID5 classes.
- An instance of a class is called a "geom" and represented by
- the data structure "g_geom". In a typical i386 FreeBSD system, there
- will be one geom of class MBR for each disk.
- A "provider", represented by the data structure
- "g_provider", is the front gate at which a geom offers service. A provider is "a
- disk-like thing which appears in /dev" - a logical disk in other
- words. All providers have three main properties: name, sectorsize and
- size.
- A "consumer" is the backdoor through which a geom connects
- to another geom provider and through which I/O requests are sent.
- The topological relationship between these entities are as
- follows:
- +o A class has zero or more geom instances.
- +o A geom has exactly one class it is derived from.
- +o A geom has zero or more consumers.
- +o A geom has zero or more providers.
- +o A consumer can be attached to zero or one providers.
- +o A provider can have zero or more consumers attached.
- All geoms have a rank-number assigned, which is used to de
- tect and prevent loops in the acyclic directed graph. This rank number
- is assigned as follows:
- 1. A geom with no attached consumers has rank=1
- 2. A geom with attached consumers has a rank one higher
- than the high est rank of the geoms of the providers its consumers
- are attached to.
SPECIAL TOPOLOGICAL MANEUVERS
- In addition to the straightforward attach, which attaches a
- consumer to a provider, and detach, which breaks the bond, a number of
- special topological maneuvers exists to facilitate configuration and to im
- prove the overall flexibility.
- TASTING is a process that happens whenever a new class or
- new provider is created and it provides the class a chance to automatically
- configure an instance on providers, which it recognize as its own. A
- typical example is the MBR disk-partition class which will look for the MBR
- table in the first sector and if found and validated it will instantiate
- a geom to multiplex according to the contents of the MBR.
- A new class will be offered to all existing providers in
- turn and a new provider will be offered to all classes in turn.
- Exactly what a class does to recognize if it should accept
- the offered provider is not defined by GEOM, but the sensible set of op
- tions are:
- +o Examine specific data structures on the disk.
- +o Examine properties like sectorsize or mediasize for the
- provider.
- +o Examine the rank number of the provider's geom.
- +o Examine the method name of the provider's geom.
- ORPHANIZATION is the process by which a provider is removed
- while it potentially is still being used.
- When a geom orphans a provider, all future I/O requests will
- "bounce" on the provider with an error code set by the geom. Any con
- sumers attached to the provider will receive notification about the orpha
- nization when the eventloop gets around to it, and they can take appropri
- ate action at that time.
- A geom which came into being as a result of a normal taste
- operation should selfdestruct unless it has a way to keep functioning
- lacking the orphaned provider. Geoms like diskslicers should therefore
- selfdestruct whereas RAID5 or mirror geoms will be able to continue, as
- long as they do not loose quorum.
- When a provider is orphaned, this does not necessarily re
- sult in any immediate change in the topology: any attached consumers are
- still attached, any opened paths are still open, any outstanding
- I/O requests are still outstanding.
- The typical scenario is +o A device driver detects a disk has departed and
- orphans the provider for it.
- +o The geoms on top of the disk receive the orpha
- nization event and orphans all their providers in turn.
- Providers, which are not attached to, will typically self-destruct
- right away. This process continues in a quasi-recursive fashion un
- til all relevant pieces of the tree has heard the bad news.
- +o Eventually the buck stops when it reaches geom_dev
- at the top of the stack.
- +o Geom_dev will call destroy_dev(9) to stop any more
- request fromcoming in. It will sleep until all (if any) out
- standing I/O requests have been returned. It will explicitly
- close (ie: zero the access counts), a change which will prop
- agate all the way down through the mesh. It will then detach
- and destroy its geom.
- +o The geom whose provider is now attached will de
- stroy theprovider, detach and destroy its consumer and de
- stroy its geom.
- +o This process percolates all the way down through
- the mesh, until the cleanup is complete.
- While this approach seems byzantine, it does provide the
- maximum flexibility and robustness in handling disappearing devices.
- The one absolutely crucial detail to be aware is that if the
- device driver does not return all I/O requests, the tree will not
- unravel.
- SPOILING is a special case of orphanization used to protect
- against stale metadata. It is probably easiest to understand spoiling by
- going through an example.
- Imagine a disk, "da0" on top of which a MBR geom provides
- "da0s1" and "da0s2" and on top of "da0s1" a BSD geom provides "da0s1a"
- through "da0s1e", both the MBR and BSD geoms have autoconfigured
- based on data structures on the disk media. Now imagine the case where
- "da0" is opened for writing and those data structures are modified or over
- written: Now the geoms would be operating on stale metadata unless some
- notification system can inform them otherwise.
- To avoid this situation, when the open of "da0" for write
- happens, all attached consumers are told about this, and geoms like MBR
- and BSD will selfdestruct as a result. When "da0" is closed again, it
- will be offered for tasting again and if the data structures for MBR and BSD
- are still there, new geoms will instantiate themselves anew.
- Now for the fine print:
- If any of the paths through the MBR or BSD module were open,
- they would have opened downwards with an exclusive bit rendering it im
- possible to open "da0" for writing in that case and conversely the re
- quested exclusive bit would render it impossible to open a path through
- the MBR geom while "da0" is open for writing.
- From this it also follows that changing the size of open ge
- oms can only be done with their cooperation.
- Finally: the spoiling only happens when the write count goes
- from zero to non-zero and the retasting only when the write count goes
- from non-zero to zero.
- INSERT/DELETE are a very special operation which allows a
- new geom to be instantiated between a consumer and a provider attached to
- each other and to remove it again.
- To understand the utility of this, imagine a provider with
- being mounted as a file system. Between the DEVFS geoms consumer and its
- provider we insert a mirror module which configures itself with one mir
- ror copy and consequently is transparent to the I/O requests on the path.
- We can now configure yet a mirror copy on the mirror geom, request a
- synchronization, and finally drop the first mirror copy. We have now
- in essence moved a mounted file system from one disk to another while
- it was being used. At this point the mirror geom can be deleted from the
- path again, it has served its purpose.
- CONFIGURE is the process where the administrator issues in
- structions for a particular class to instantiate itself. There are multi
- ple ways to express intent in this case, a particular provider can be
- specified with a level of override forcing for instance a BSD disklabel
- module to attach to a provider which was not found palatable during the TASTE
- operation.
- Finally IO is the reason we even do this: it concerns itself
- with sending I/O requests through the graph.
- I/O REQUESTS represented by struct bio, originate at a con
- sumer, are scheduled on its attached provider, and when processed, re
- turned to the consumer. It is important to realize that the struct bio
- which enters through the provider of a particular geom does not "come out
- on the other side". Even simple transformations like MBR and BSD will
- clone the struct bio, modify the clone, and schedule the clone on
- their own consumer. Note that cloning the struct bio does not involve
- cloning the actual data area specified in the IO request.
- In total four different IO requests exist in GEOM: read,
- write, delete, and get attribute.
- Read and write are self explanatory.
- Delete indicates that a certain range of data is no longer
- used and that it can be erased or freed as the underlying technology sup
- ports. Technologies like flash adaptation layers can arrange to erase
- the relevant blocks before they will become reassigned and cryptographic
- devices may want to fill random bits into the range to reduce the amount
- of data available for attack.
- It is important to recognize that a delete indication is not
- a request and consequently there is no guarantee that the data actual
- ly will be erased or made unavailable unless guaranteed by specific ge
- oms in the graph. If "secure delete" semantics are required, a geom
- should be pushed which converts delete indications into (a sequence
- of) write requests.
- Get attribute supports inspection and manipulation of out
- of-band attributes on a particular provider or path. Attributes are
- named by ascii strings and they will be discussed in a separate sec
- tion below.
- (stay tuned while the author rests his brain and fingers:
- more to come.)
DIAGNOSTICS
- Several flags are provided for tracing GEOM operations and
- unlocking protection mechanisms via the kern.geom.debugflags sysctl. All
- of these flags are off by default, and great care should be taken in
- turning them on.
- 0x01 (G_T_TOPOLOGY) Provide tracing of topology change events.
- 0x02 (G_T_BIO) Provide tracing of buffer I/O requests.
- 0x04 (G_T_ACCESS) Provide tracing of access check controls.
- 0x08 (unused)
- 0x10 (allow foot shooting) Allow writing to Rank 1 providers. This would, for
- example, allow the super-user to overwrite the MBR on the root disk
- or write random sectors elsewhere to a mounted disk. The implica
- tions are obvious.
- 0x20 (G_T_DETAILS) This appears to be unused at this time.
- 0x40 (G_F_DISKIOCTL) This appears to be unused at this time.
- 0x80 (G_F_CTLDUMP) Dump contents of gctl requests.
HISTORY
- This software was developed for the FreeBSD Project by Poul
- Henning Kamp and NAI Labs, the Security Research Division of Network As
- sociates, Inc. under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as
- part of the DARPA CHATS research program.
- The first precursor for GEOM was a gruesome hack to Minix
- 1.2 and was never distributed. An earlier attempt to implement a less
- general scheme in FreeBSD never succeeded.
AUTHORS
- Poul-Henning Kamp <phk@FreeBSD.org>
- BSD March 27, 2002