regexp(3)
NAME
- regcomp, regexec, regsub, regerror - regular expression han
- dlers
LIBRARY
Compatibility Library (libcompat, -lcompat)
SYNOPSIS
#include <regexp.h> regexp * regcomp(const char *exp); int regexec(const regexp *prog, const char *string); void regsub(const regexp *prog, const char *source, char *dest);
DESCRIPTION
This interface is made obsolete by regex(3).
- The regcomp(), regexec(), regsub(), and regerror() functions
- implement
egrep(1)-style regular expressions and supporting facili - ties.
- The regcomp() function compiles a regular expression into a
- structure of
type regexp, and returns a pointer to it. The space has - been allocated
using malloc(3) and may be released by free(3). - The regexec() function matches a NUL-terminated string
- against the compiled regular expression in prog. It returns 1 for success
- and 0 for
failure, and adjusts the contents of prog's startp and endp - (see below)
accordingly. - The members of a regexp structure include at least the fol
- lowing (not
necessarily in order):
char *startp[NSUBEXP];
char *endp[NSUBEXP];- where NSUBEXP is defined (as 10) in the header file. Once a
- successful
regexec() has been done using the regexp(), each startp - endp pair
describes one substring within the string, with the startp - pointing to
the first character of the substring and the endp pointing - to the first
character following the substring. The 0th substring is the - substring of
string that matched the whole regular expression. The oth - ers are those
substrings that matched parenthesized expressions within the - regular
expression, with parenthesized expressions numbered in left - to-right
order of their opening parentheses. - The regsub() function copies source to dest, making substi
- tutions according to the most recent regexec() performed using prog. Each
- instance of
`&' in source is replaced by the substring indicated by - startp[] and
endp[]. Each instance of `n', where n is a digit, is re - placed by the
substring indicated by startp[n] and endp[n]. To get a lit - eral `&' or
`n' into dest, prefix it with `'; to get a literal `' pre - ceding `&' or
`n', prefix it with another `'. - The regerror() function is called whenever an error is de
- tected in
regcomp(), regexec(), or regsub(). The default regerror() - writes the
string msg, with a suitable indicator of origin, on the - standard error
output and invokes exit(3). The regerror() function can be - replaced by
the user if other actions are desirable.
REGULAR EXPRESSION SYNTAX
- A regular expression is zero or more branches, separated by
- `|'. It
matches anything that matches one of the branches. - A branch is zero or more pieces, concatenated. It matches a
- match for
the first, followed by a match for the second, etc. - A piece is an atom possibly followed by `*', `+', or `?'.
- An atom followed by `*' matches a sequence of 0 or more matches of the
- atom. An
atom followed by `+' matches a sequence of 1 or more matches - of the atom.
An atom followed by `?' matches a match of the atom, or the - null string.
- An atom is a regular expression in parentheses (matching a
- match for the
regular expression), a range (see below), `.' (matching any - single character), `^' (matching the null string at the beginning of
- the input
string), `$' (matching the null string at the end of the in - put string), a
`' followed by a single character (matching that character), - or a single
character with no other significance (matching that charac - ter).
- A range is a sequence of characters enclosed in `[]'. It
- normally
matches any single character from the sequence. If the se - quence begins
with `^', it matches any single character not from the rest - of the
sequence. If two characters in the sequence are separated - by `-', this
is shorthand for the full list of ASCII characters between - them (e.g.
`[0-9]' matches any decimal digit). To include a literal - `]' in the
sequence, make it the first character (following a possible - `^'). To
include a literal `-', make it the first or last character.
AMBIGUITY
- If a regular expression could match two different parts of
- the input
string, it will match the one which begins earliest. If - both begin in
the same place but match different lengths, or match the - same length in
different ways, life gets messier, as follows. - In general, the possibilities in a list of branches are con
- sidered in
left-to-right order, the possibilities for `*', `+', and `?' - are considered longest-first, nested constructs are considered from
- the outermost
in, and concatenated constructs are considered leftmost - first. The match
that will be chosen is the one that uses the earliest possi - bility in the
first choice that has to be made. If there is more than one - choice, the
next will be made in the same manner (earliest possibility) - subject to
the decision on the first choice. And so forth. - For example, `(ab|a)b*c' could match `abc' in one of two
- ways. The first
choice is between `ab' and `a'; since `ab' is earlier, and - does lead to a
successful overall match, it is chosen. Since the `b' is - already spoken
for, the `b*' must match its last possibility--the empty - string--since it
must respect the earlier choice. - In the particular case where no `|'s are present and there
- is only one
`*', `+', or `?', the net effect is that the longest possi - ble match will
be chosen. So `ab*', presented with `xabbbby', will match - `abbbb'. Note
that if `ab*', is tried against `xabyabbbz', it will match - `ab' just
after `x', due to the begins-earliest rule. (In effect, the - decision on
where to start the match is the first choice to be made, - hence subsequent
choices must respect it even if this leads them to less-pre - ferred alternatives.)
RETURN VALUES
- The regcomp() function returns NULL for a failure
- (regerror() permitting), where failures are syntax errors, exceeding implemen
- tation limits,
or applying `+' or `*' to a possibly-null operand.
SEE ALSO
ed(1), egrep(1), ex(1), expr(1), fgrep(1), grep(1), regex(3)
HISTORY
- Both code and manual page for regcomp(), regexec(),
- regsub(), and
regerror() were written at the University of Toronto and ap - peared in
4.3BSD-Tahoe. They are intended to be compatible with the - Bell V8
regexp(3), but are not derived from Bell code.
BUGS
- Empty branches and empty regular expressions are not
- portable to V8.
- The restriction against applying `*' or `+' to a possibly
- null operand is
an artifact of the simplistic implementation. - Does not support egrep(1)'s newline-separated branches; nei
- ther does the
V8 regexp(3), though. - Due to emphasis on compactness and simplicity, it is not
- strikingly fast.
It does give special attention to handling simple cases - quickly.
- BSD June 4, 1993