geom(4)

NAME

GEOM - modular disk I/O request transformation framework.

DESCRIPTION

The GEOM framework provides an infrastructure in which

"classes" can perform transformations on disk I/O requests on their path from
the upper

kernel to the device drivers and back.
Transformations in a GEOM context range from the simple geo
metric displacement performed in typical disk partitioning modules
over RAID algorithms and device multipath resolution to full blown crypto
graphic protection of the stored data.
Compared to traditional "volume management", GEOM differs
from most and

in some cases all previous implementations in the following
ways:
+o GEOM is extensible. It is trivially simple to write a
new class of
transformation and it will not be given stepchild treat
ment. If

someone for some reason wanted to mount IBM MVS
diskpacks, a class

recognizing and configuring their VTOC information would
be a trivial

matter.
+o GEOM is topologically agnostic. Most volume management
implementa
tions have very strict notions of how classes can fit
together, very

often one fixed hierarchy is provided for instance sub
disk - plex volume.
Being extensible means that new transformations are treated
no differently than existing transformations.
Fixed hierarchies are bad because they make it impossible to
express the

intent efficiently. In the fixed hierarchy above it is not
possible to

mirror two physical disks and then partition the mirror into
subdisks,

instead one is forced to make subdisks on the physical vol
umes and to

mirror these two and two resulting in a much more complex
configuration.

GEOM on the other hand does not care in which order things
are done, the

only restriction is that cycles in the graph will not be al
lowed.

TERMINOLOGY and TOPOLOGY

GEOM is quite object oriented and consequently the terminol
ogy borrows a

lot of context and semantics from the OO vocabulary:
A "class", represented by the data structure g_class imple
ments one particular kind of transformation. Typical examples are MBR
disk partition,

BSD disklabel, and RAID5 classes.
An instance of a class is called a "geom" and represented by
the data

structure "g_geom". In a typical i386 FreeBSD system, there
will be one

geom of class MBR for each disk.
A "provider", represented by the data structure
"g_provider", is the

front gate at which a geom offers service. A provider is "a
disk-like

thing which appears in /dev" - a logical disk in other
words. All

providers have three main properties: name, sectorsize and
size.
A "consumer" is the backdoor through which a geom connects
to another

geom provider and through which I/O requests are sent.
The topological relationship between these entities are as
follows:
+o A class has zero or more geom instances.
+o A geom has exactly one class it is derived from.
+o A geom has zero or more consumers.
+o A geom has zero or more providers.
+o A consumer can be attached to zero or one providers.
+o A provider can have zero or more consumers attached.
All geoms have a rank-number assigned, which is used to de
tect and prevent loops in the acyclic directed graph. This rank number
is assigned

as follows:
1. A geom with no attached consumers has rank=1
2. A geom with attached consumers has a rank one higher
than the high

est rank of the geoms of the providers its consumers
are attached

to.

SPECIAL TOPOLOGICAL MANEUVERS

In addition to the straightforward attach, which attaches a

consumer to a

provider, and detach, which breaks the bond, a number of
special topological maneuvers exists to facilitate configuration and to im
prove the

overall flexibility.
TASTING is a process that happens whenever a new class or
new provider is

created and it provides the class a chance to automatically
configure an

instance on providers, which it recognize as its own. A
typical example

is the MBR disk-partition class which will look for the MBR
table in the

first sector and if found and validated it will instantiate
a geom to

multiplex according to the contents of the MBR.
A new class will be offered to all existing providers in
turn and a new

provider will be offered to all classes in turn.
Exactly what a class does to recognize if it should accept
the offered

provider is not defined by GEOM, but the sensible set of op
tions are:
+o Examine specific data structures on the disk.
+o Examine properties like sectorsize or mediasize for the
provider.
+o Examine the rank number of the provider's geom.
+o Examine the method name of the provider's geom.
ORPHANIZATION is the process by which a provider is removed
while it

potentially is still being used.
When a geom orphans a provider, all future I/O requests will
"bounce" on

the provider with an error code set by the geom. Any con
sumers attached

to the provider will receive notification about the orpha
nization when

the eventloop gets around to it, and they can take appropri
ate action at

that time.
A geom which came into being as a result of a normal taste
operation

should selfdestruct unless it has a way to keep functioning
lacking the

orphaned provider. Geoms like diskslicers should therefore
selfdestruct

whereas RAID5 or mirror geoms will be able to continue, as
long as they

do not loose quorum.
When a provider is orphaned, this does not necessarily re
sult in any

immediate change in the topology: any attached consumers are
still

attached, any opened paths are still open, any outstanding
I/O requests

are still outstanding.
The typical scenario is

+o A device driver detects a disk has departed and
orphans the

provider for it.
+o The geoms on top of the disk receive the orpha
nization event

and orphans all their providers in turn.
Providers, which are

not attached to, will typically self-destruct
right away. This

process continues in a quasi-recursive fashion un
til all relevant pieces of the tree has heard the bad news.
+o Eventually the buck stops when it reaches geom_dev
at the top

of the stack.
+o Geom_dev will call destroy_dev(9) to stop any more
request fromcoming in. It will sleep until all (if any) out
standing I/O

requests have been returned. It will explicitly
close (ie:

zero the access counts), a change which will prop
agate all the

way down through the mesh. It will then detach
and destroy its

geom.
+o The geom whose provider is now attached will de
stroy theprovider, detach and destroy its consumer and de
stroy its geom.
+o This process percolates all the way down through
the mesh,

until the cleanup is complete.
While this approach seems byzantine, it does provide the
maximum flexibility and robustness in handling disappearing devices.
The one absolutely crucial detail to be aware is that if the
device

driver does not return all I/O requests, the tree will not
unravel.
SPOILING is a special case of orphanization used to protect
against stale

metadata. It is probably easiest to understand spoiling by
going through

an example.
Imagine a disk, "da0" on top of which a MBR geom provides
"da0s1" and

"da0s2" and on top of "da0s1" a BSD geom provides "da0s1a"
through

"da0s1e", both the MBR and BSD geoms have autoconfigured
based on data

structures on the disk media. Now imagine the case where
"da0" is opened

for writing and those data structures are modified or over
written: Now

the geoms would be operating on stale metadata unless some
notification

system can inform them otherwise.
To avoid this situation, when the open of "da0" for write
happens, all

attached consumers are told about this, and geoms like MBR
and BSD will

selfdestruct as a result. When "da0" is closed again, it
will be offered

for tasting again and if the data structures for MBR and BSD
are still

there, new geoms will instantiate themselves anew.
Now for the fine print:
If any of the paths through the MBR or BSD module were open,
they would

have opened downwards with an exclusive bit rendering it im
possible to

open "da0" for writing in that case and conversely the re
quested exclusive bit would render it impossible to open a path through
the MBR geom

while "da0" is open for writing.
From this it also follows that changing the size of open ge
oms can only

be done with their cooperation.
Finally: the spoiling only happens when the write count goes
from zero to

non-zero and the retasting only when the write count goes
from non-zero

to zero.
INSERT/DELETE are a very special operation which allows a
new geom to be

instantiated between a consumer and a provider attached to
each other and

to remove it again.
To understand the utility of this, imagine a provider with
being mounted

as a file system. Between the DEVFS geoms consumer and its
provider we

insert a mirror module which configures itself with one mir
ror copy and

consequently is transparent to the I/O requests on the path.
We can now

configure yet a mirror copy on the mirror geom, request a
synchronization, and finally drop the first mirror copy. We have now
in essence

moved a mounted file system from one disk to another while
it was being

used. At this point the mirror geom can be deleted from the
path again,

it has served its purpose.
CONFIGURE is the process where the administrator issues in
structions for

a particular class to instantiate itself. There are multi
ple ways to

express intent in this case, a particular provider can be
specified with

a level of override forcing for instance a BSD disklabel
module to attach

to a provider which was not found palatable during the TASTE
operation.
Finally IO is the reason we even do this: it concerns itself
with sending

I/O requests through the graph.
I/O REQUESTS represented by struct bio, originate at a con
sumer, are

scheduled on its attached provider, and when processed, re
turned to the

consumer. It is important to realize that the struct bio
which enters

through the provider of a particular geom does not "come out
on the other

side". Even simple transformations like MBR and BSD will
clone the

struct bio, modify the clone, and schedule the clone on
their own consumer. Note that cloning the struct bio does not involve
cloning the

actual data area specified in the IO request.
In total four different IO requests exist in GEOM: read,
write, delete,

and get attribute.
Read and write are self explanatory.
Delete indicates that a certain range of data is no longer
used and that

it can be erased or freed as the underlying technology sup
ports. Technologies like flash adaptation layers can arrange to erase
the relevant

blocks before they will become reassigned and cryptographic
devices may

want to fill random bits into the range to reduce the amount
of data

available for attack.
It is important to recognize that a delete indication is not
a request

and consequently there is no guarantee that the data actual
ly will be

erased or made unavailable unless guaranteed by specific ge
oms in the

graph. If "secure delete" semantics are required, a geom
should be

pushed which converts delete indications into (a sequence
of) write

requests.
Get attribute supports inspection and manipulation of out
of-band

attributes on a particular provider or path. Attributes are
named by

ascii strings and they will be discussed in a separate sec
tion below.
(stay tuned while the author rests his brain and fingers:
more to come.)

DIAGNOSTICS

Several flags are provided for tracing GEOM operations and

unlocking protection mechanisms via the kern.geom.debugflags sysctl. All
of these

flags are off by default, and great care should be taken in
turning them

on.
0x01 (G_T_TOPOLOGY)

Provide tracing of topology change events.
0x02 (G_T_BIO)

Provide tracing of buffer I/O requests.
0x04 (G_T_ACCESS)

Provide tracing of access check controls.
0x08 (unused)
0x10 (allow foot shooting)

Allow writing to Rank 1 providers. This would, for
example, allow

the super-user to overwrite the MBR on the root disk
or write random sectors elsewhere to a mounted disk. The implica
tions are

obvious.
0x20 (G_T_DETAILS)

This appears to be unused at this time.
0x40 (G_F_DISKIOCTL)

This appears to be unused at this time.
0x80 (G_F_CTLDUMP)

Dump contents of gctl requests.

HISTORY

This software was developed for the FreeBSD Project by Poul
Henning Kamp

and NAI Labs, the Security Research Division of Network As
sociates, Inc.

under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as
part of the

DARPA CHATS research program.
The first precursor for GEOM was a gruesome hack to Minix
1.2 and was

never distributed. An earlier attempt to implement a less
general scheme

in FreeBSD never succeeded.

AUTHORS

Poul-Henning Kamp <phk@FreeBSD.org>

BSD March 27, 2002
Copyright © 2010-2025 Platon Technologies, s.r.o.           Index | Man stránky | tLDP | Dokumenty | Utilitky | O projekte
Design by styleshout