recdescent(3)
NAME
Parse::RecDescent - Generate Recursive-Descent Parsers
VERSION
This document describes version 1.80 of Parse::RecDescent,
released January 20, 2001.
SYNOPSIS
use Parse::RecDescent;
# Generate a parser from the specification in $grammar:
$parser = new Parse::RecDescent ($grammar);
# Generate a parser from the specification in $othergrammar
$anotherparser = new Parse::RecDescent ($othergrammar);
# Parse $text using rule 'startrule' (which must be
# defined in $grammar):
$parser->startrule($text);
# Parse $text using rule 'otherrule' (which must also
# be defined in $grammar):
$parser->otherrule($text);
# Change the universal token prefix pattern
# (the default is: ''):
$Parse::RecDescent::skip = '[ ]+';
# Replace productions of existing rules (or create new
ones)
# with the productions defined in $newgrammar:
$parser->Replace($newgrammar);
# Extend existing rules (or create new ones)
# by adding extra productions defined in $moregrammar:
$parser->Extend($moregrammar);
# Global flags (useful as command line arguments under
-s):
$::RD_ERRORS # unless undefined, report fatal errors
$::RD_WARN # unless undefined, also report
non-fatal problems
$::RD_HINT # if defined, also suggestion
remedies
$::RD_TRACE # if defined, also trace
parsers' behaviour
$::RD_AUTOSTUB # if defined, generates "stubs"
for undefined rules
$::RD_AUTOACTION # if defined, appends specified
action to productions
DESCRIPTION
Overview
Parse::RecDescent incrementally generates top-down recur
sive-descent text parsers from simple yacc-like grammar
specifications. It provides:
- · Regular expressions or literal strings as terminals
- (tokens),
- · Multiple (non-contiguous) productions for any rule,
- · Repeated and optional subrules within productions,
- · Full access to Perl within actions specified as part
- of the grammar,
- · Simple automated error reporting during parser genera
- tion and parsing,
- · The ability to commit to, uncommit to, or reject par
- ticular productions during a parse,
- · The ability to pass data up and down the parse tree
- ("down" via subrule argument lists, "up" via subrule
return values) - · Incremental extension of the parsing grammar (even
- during a parse),
- · Precompilation of parser objects,
- · User-definable reduce-reduce conflict resolution via
- "scoring" of matching productions.
- Using "Parse::RecDescent"
- Parser objects are created by calling "Parse::RecDes
cent::new", passing in a grammar specification (see the
following subsections). If the grammar is correct, "new"
returns a blessed reference which can then be used to ini
tiate parsing through any rule specified in the original
grammar. A typical sequence looks like this:
$grammar = q {# GRAMMAR SPECIFICATION HERE- };
- $parser = new Parse::RecDescent ($grammar) or die
- "Bad grammar!0;
- # acquire $text
- defined $parser->startrule($text) or print "Bad
- text!0;
- The rule through which parsing is initiated must be
explicitly defined in the grammar (i.e. for the above
example, the grammar must include a rule of the form:
"startrule: <subrules>". - If the starting rule succeeds, its value (see below) is
returned. Failure to generate the original parser or fail
ure to match a text is indicated by returning "undef".
Note that it's easy to set up grammars that can succeed,
but which return a value of 0, "0", or "". So don't be
tempted to write:
$parser->startrule($text) or print "Bad text!0;- Normally, the parser has no effect on the original text.
So in the previous example the value of $text would be
unchanged after having been parsed. - If, however, the text to be matched is passed by refer
ence:
$parser->startrule(ext)- then any text which was consumed during the match will be
removed from the start of $text. - Rules
- In the grammar from which the parser is built, rules are
specified by giving an identifier (which must satisfy
/[A-Za-z]1104
lowed by one or more productions, separated by single ver
tical bars. The layout of the productions is entirely
free-format:
rule1: production1| production2production3 | production4- At any point in the grammar previously defined rules may
be extended with additional productions. This is achieved
by redeclaring the rule with the new productions. Thus:
rule1: a | b | c
rule2: d | e | f
rule1: g | h- is exactly equivalent to:
rule1: a | b | c | g | h
rule2: d | e | f- Each production in a rule consists of zero or more items,
each of which may be either: the name of another rule to
be matched (a "subrule"), a pattern or string literal to
be matched directly (a "token"), a block of Perl code to
be executed (an "action"), a special instruction to the
parser (a "directive"), or a standard Perl comment (which
is ignored). - A rule matches a text if one of its productions matches. A
production matches if each of its items match consecutive
substrings of the text. The productions of a rule being
matched are tried in the same order that they appear in
the original grammar, and the first matching production
terminates the match attempt (successfully). If all pro
ductions are tried and none matches, the match attempt
fails. - Note that this behaviour is quite different from the "pre
fer the longer match" behaviour of yacc. For example, if
yacc were parsing the rule:
seq : 'A' 'B'| 'A' 'B' 'C'- upon matching "AB" it would look ahead to see if a 'C' is
next and, if so, will match the second production in pref
erence to the first. In other words, yacc effectively
tries all the productions of a rule breadth-first in par
allel, and selects the "best" match, where "best" means
longest (note that this is a gross simplification of the
true behaviour of yacc but it will do for our purposes). - In contrast, "Parse::RecDescent" tries each production
depth-first in sequence, and selects the "best" match,
where "best" means first. This is the fundamental differ
ence between "bottom-up" and "recursive descent" parsing. - Each successfully matched item in a production is assigned
a value, which can be accessed in subsequent actions
within the same production (or, in some cases, as the
return value of a successful subrule call). Unsuccessful
items don't have an associated value, since the failure of
an item causes the entire surrounding production to imme
diately fail. The following sections describe the various
types of items and their success values. - Subrules
- A subrule which appears in a production is an instruction
to the parser to attempt to match the named rule at that
point in the text being parsed. If the named subrule is
not defined when requested the production containing it
immediately fails (unless it was "autostubbed" - see
Autostubbing). - A rule may (recursively) call itself as a subrule, but not
as the left-most item in any of its productions (since
such recursions are usually non-terminating). - The value associated with a subrule is the value associ
ated with its $return variable (see "Actions" below), or
with the last successfully matched item in the subrule
match. - Subrules may also be specified with a trailing repetition
specifier, indicating that they are to be (greedily)
matched the specified number of times. The available spec
ifiers are:
subrule(?) # Match one-or-zero times
subrule(s) # Match one-or-more times
subrule(s?) # Match zero-or-more times
subrule(N) # Match exactly N times- for integer N > 0
subrule(N..M) # Match between N and M - times
subrule(..M) # Match between 1 and M - times
subrule(N..) # Match at least N times - Repeated subrules keep matching until either the subrule
fails to match, or it has matched the minimal number of
times but fails to consume any of the parsed text (this
second condition prevents the subrule matching forever in
some cases). - Since a repeated subrule may match many instances of the
subrule itself, the value associated with it is not a sim
ple scalar, but rather a reference to a list of scalars,
each of which is the value associated with one of the
individual subrule matches. In other words in the rule:
program: statement(s)- the value associated with the repeated subrule "state
ment(s)" is a reference to an array containing the values
matched by each call to the individual subrule "state
ment". - Repetition modifieres may include a separator pattern:
program: statement(s /;/)- specifying some sequence of characters to be skipped
between each repetition. This is really just a shorthand
for the <leftop:...> directive (see below). - Tokens
- If a quote-delimited string or a Perl regex appears in a
production, the parser attempts to match that string or
pattern at that point in the text. For example:
typedef: "typedef" typename identifier ';'- identifier: /[A-Za-z_][A-Za-z0-9_]*/
- As in regular Perl, a single quoted string is uninterpo
lated, whilst a double-quoted string or a pattern is
interpolated (at the time of matching, not when the parser
is constructed). Hence, it is possible to define rules in
which tokens can be set at run-time:
typedef: "$::typedefkeyword" typename- identifier ';'
- identifier: /$::identpat/
- Note that, since each rule is implemented inside a special
namespace belonging to its parser, it is necessary to
explicitly quantify variables from the main package. - Regex tokens can be specified using just slashes as delim
iters or with the explicit "m<delimiter>......<delimiter>"
syntax:
typedef: "typedef" typename identifier ';'- typename: /[A-Za-z_][A-Za-z0-9_]*/
- identifier: m{[A-Za-z_][A-Za-z0-9_]*}
- A regex of either type can also have any valid trailing
parameter(s) (that is, any of [cgimsox]):
typedef: "typedef" typename identifier ';'- identifier: / [a-z_] # LEADING
- ALPHA OR UNDERSCORE
[a-z0-9_]* # THENDIGITS ALSO ALLOWED
- /ix #
- CASE/SPACE/COMMENT INSENSITIVE
- The value associated with any successfully matched token
is a string containing the actual text which was matched
by the token. - It is important to remember that, since each grammar is
specified in a Perl string, all instances of the universal
escape character '´ within a grammar must be "doubled",
so that they interpolate to single '´s when the string is
compiled. For example, to use the grammar:
word: // | backslash
line: prefix word(s) "0
backslash: '´- the following code is required:
$parser = new Parse::RecDescent (q{
word: // | backslash
line: prefix word(s) "0
backslash: '´- });
- Terminal Separators
- For the purpose of matching, each terminal in a production
is considered to be preceded by a "prefix" - a pattern
which must be matched before a token match is attempted.
By default, the prefix is optional whitespace (which
always matches, at least trivially), but this default may
be reset in any production. - The variable $Parse::RecDescent::skip stores the universal
prefix, which is the default for all terminal matches in
all parsers built with "Parse::RecDescent". - The prefix for an individual production can be altered by
using the "<skip:...>" directive (see below). - Actions
- An action is a block of Perl code which is to be executed
(as the block of a "do" statement) when the parser reaches
that point in a production. The action executes within a
special namespace belonging to the active parser, so care
must be taken in correctly qualifying variable names (see
also "Start-up Actions" below). - The action is considered to succeed if the final value of
the block is defined (that is, if the implied "do" state
ment evaluates to a defined value - even one which would be treated as "false"). Note that the value associated with a successful action is also the final value in the
block. - An action will fail if its last evaluated value is
"undef". This is surprisingly easy to accomplish by acci
dent. For instance, here's an infuriating case of an
action that makes its production fail, but only when
debugging isn't activated:
description: name rank serial_number{ print "Got $item[2] $item[1]($item[3])0if $::debugging}- If $debugging is false, no statement in the block is exe
cuted, so the final value is "undef", and the entire pro
duction fails. The solution is:
description: name rank serial_number{ print "Got $item[2] $item[1]($item[3])0if $::debugging;1;- }
- Within an action, a number of useful parse-time variables
are available in the special parser namespace (there are
other variables also accessible, but meddling with them
will probably just break your parser. As a general rule,
if you avoid referring to unqualified variables - espe
cially those starting with an underscore - inside an
action, things should be okay): - @item and %item
- The array slice @item[1..$#item] stores the value
associated with each item (that is, each subrule,
token, or action) in the current production. The anal
ogy is to $1, $2, etc. in a yacc grammar. Note that,
for obvious reasons, @item only contains the values of
items before the current point in the production. - The first element ($item[0]) stores the name of the
current rule being matched. - @item is a standard Perl array, so it can also be
indexed with negative numbers, representing the number
of items back from the current position in the parse:
stuff: /various/ bits 'and' pieces "then" data'end'{ print $item[-2] } # PRINTSdata# (EASIERTHAN: $item[6]) - The %item hash complements the <@item> array, provid
ing named access to the same item values:
stuff: /various/ bits 'and' pieces "then" data- 'end'
{ print $item{data} # PRINTSdata# (EVENEASIER THAN USING @item)
- The results of named subrules are stored in the hash
under each subrule's name, whilst all other items are
stored under a "named positional" key that indictates
their ordinal position within their item type:
__STRINGn__, __PATTERNn__, __DIRECTIVEn__,
__ACTIONn__:
stuff: /various/ bits 'and' pieces "then" data- 'end' { save }
{ print $item{__PATTERN1__}, #PRINTS 'various'$item{__STRING2__}, #PRINTS 'then'
$item{__ACTION1__}, #PRINTS RETURN#VALUE OF save} - If you want proper named access to patterns or liter
als, you need to turn them into separate rules:
stuff: various bits 'and' pieces "then" data- 'end'
{ print $item{various} #PRINTS various
} - various: /various/
- The special entry $item{__RULE__} stores the name of
the current rule (i.e. the same value as $item[0]. - The advantage of using %item, instead of @items is
that it removes the need to track items positions that
may change as a grammar evolves. For example, adding
an interim "<skip>" directive of action can silently
ruin a trailing action, by moving an @item element
"down" the array one place. In contrast, the named
entry of %item is unaffected by such an insertion. - A limitation of the %item hash is that it only records
the last value of a particular subrule. For example:
range: '(' number '..' number )'{ $return = $item{number} }- will return only the value corresponding to the second
match of the "number" subrule. In other words,
successive calls to a subrule overwrite the corre
sponding entry in %item. Once again, the solution is
to rename each subrule in its own rule:
range: '(' from_num '..' to_num )'{ $return = $item{from_num} }- from_num: number
to_num: number - @arg and %arg
- The array @arg and the hash %arg store any arguments
passed to the rule from some other rule (see ""Subrule
argument lists"). Changes to the elements of either
variable do not propagate back to the calling rule
(data can be passed back from a subrule via the
$return variable - see next item). - $return
- If a value is assigned to $return within an action,
that value is returned if the production containing
the action eventually matches successfully. Note that
setting $return doesn't cause the current production to succeed. It merely tells it what to return if it
does succeed. Hence $return is analogous to $$ in a
yacc grammar. - If $return is not assigned within a production, the
value of the last component of the production (namely:
$item[$#item]) is returned if the production succeeds. - $commit
- The current state of commitment to the current produc
tion (see "Directives" below). - $skip
- The current terminal prefix (see "Directives" below).
- $text
- The remaining (unparsed) text. Changes to $text do not
propagate out of unsuccessful productions, but do sur vive successful productions. Hence it is possible to
dynamically alter the text being parsed - for example,
to provide a "#include"-like facility:
hash_include: '#include' filename{ $text = ::loadfile($item[2]) . $text } - filename: '<' /[a-z0-9._-]+/i '>' { $return =
- $item[2] }
- | '"' /[a-z0-9._-]+/i '"' { $return =
- $item[2] }
- $thisline and $prevline
- $thisline stores the current line number within the
current parse (starting from 1). $prevline stores the
line number for the last character which was already
successfully parsed (this will be different from
$thisline at the end of each line). - For efficiency, $thisline and $prevline are actually
tied hashes, and only recompute the required line num
ber when the variable's value is used. - Assignment to $thisline adjusts the line number calcu
lator, so that it believes that the current line num
ber is the value being assigned. Note that this
adjustment will be reflected in all subsequent line
numbers calculations. - Modifying the value of the variable $text (as in the
previous "hash_include" example, for instance) will
confuse the line counting mechanism. To prevent this,
you should call "Parse::RecDes
cent::LineCounter::resync($thisline)" immediately after any assignment to the variable $text (or, at
least, before the next attempt to use $thisline). - Note that if a production fails after assigning to or
resync'ing $thisline, the parser's line counter mecha
nism will usually be corrupted. - Also see the entry for @itempos.
- The line number can be set to values other than 1, by
calling the start rule with a second argument. For
example:
$parser = new Parse::RecDescent ($grammar);$parser->input($text, 10); # START LINENUMBERS AT 10 - $thiscolumn and $prevcolumn
- $thiscolumn stores the current column number within
the current line being parsed (starting from 1). $pre
vcolumn stores the column number of the last character
which was actually successfully parsed. Usually "$pre
vcolumn == $thiscolumn-1", but not at the end of
lines. - For efficiency, $thiscolumn and $prevcolumn are actu
ally tied hashes, and only recompute the required col
umn number when the variable's value is used. - Assignment to $thiscolumn or $prevcolumn is a fatal
error. - Modifying the value of the variable $text (as in the
previous "hash_include" example, for instance) may
confuse the column counting mechanism. - Note that $thiscolumn reports the column number before
any whitespace that might be skipped before reading a
token. Hence if you wish to know where a token started
(and ended) use something like this:
rule: token1 token2 startcol token3 endcol token4{ print "token3: columns$item[3] to $item[5]"; } - startcol: // { $thiscolumn } # NEED THE //
- TO STEP PAST TOKEN SEP
endcol: { $prevcolumn } - Also see the entry for @itempos.
- $thisoffset and $prevoffset
- $thisoffset stores the offset of the current parsing
position within the complete text being parsed (start
ing from 0). $prevoffset stores the offset of the last
character which was actually successfully parsed. In
all cases "$prevoffset == $thisoffset-1". - For efficiency, $thisoffset and $prevoffset are actu
ally tied hashes, and only recompute the required off
set when the variable's value is used. - Assignment to $thisoffset or <$prevoffset> is a fatal
error. - Modifying the value of the variable $text will not
affect the offset counting mechanism. - Also see the entry for @itempos.
- @itempos
- The array @itempos stores a hash reference correspond
ing to each element of @item. The elements of the hash
provide the following:
$itempos[$n]{offset}{from} # VALUE OF$thisoffset BEFORE $item[$n]
$itempos[$n]{offset}{to} # VALUE OF$prevoffset AFTER $item[$n]
$itempos[$n]{line}{from} # VALUE OF$thisline BEFORE $item[$n]
$itempos[$n]{line}{to} # VALUE OF$prevline AFTER $item[$n]
$itempos[$n]{column}{from} # VALUE OF$thiscolumn BEFORE $item[$n]
$itempos[$n]{column}{to} # VALUE OF$prevcolumn AFTER $item[$n] - Note that the various "$itempos[$n]...{from}" values
record the appropriate value after any token prefix
has been skipped. - Hence, instead of the somewhat tedious and
error-prone:
rule: startcol token1 endcolstartcol token2 endcol
startcol token3 endcol{ print "token1: columns$item[1]to$item[3]token2: columns$item[4]to$item[6]token3: columns$item[7]to$item[9]" }- startcol: // { $thiscolumn } # NEED THE //
- TO STEP PAST TOKEN SEP
endcol: { $prevcolumn } - it is possible to write:
rule: token1 token2 token3{ print "token1: columns$itempos[1]{column}{from}to$itempos[1]{column}{to}token2: columns$itempos[2]{column}{from}to$itempos[2]{column}{to}token3: columns$itempos[3]{column}{from}to$itempos[3]{column}{to}" }- Note however that (in the current implementation) the
use of @itempos anywhere in a grammar implies that
item positioning information is collected everywhere during the parse. Depending on the grammar and the
size of the text to be parsed, this may be pro
hibitively expensive and the explicit use of $this
line, $thiscolumn, etc. may be a better choice. - $thisparser
- A reference to the "Parse::RecDescent" object through
which parsing was initiated. - The value of $thisparser propagates down the subrules
of a parse but not back up. Hence, you can invoke sub
rules from another parser for the scope of the current
rule as follows:
rule: subrule1 subrule2| { $thisparser = $::otherparser } <reject>
| subrule3 subrule4
| subrule5 - The result is that the production calls "subrule1" and
"subrule2" of the current parser, and the remaining
productions call the named subrules from $::other
parser. Note, however that "Bad Things" will happen if
"::otherparser" isn't a blessed reference and/or
doesn't have methods with the same names as the
required subrules! - $thisrule
- A reference to the "Parse::RecDescent::Rule" object
corresponding to the rule currently being matched. - $thisprod
- A reference to the "Parse::RecDescent::Production"
object corresponding to the production currently being
matched. - $score and $score_return
- $score stores the best production score to date, as
specified by an earlier "<score:...>" directive.
$score_return stores the corresponding return value
for the successful production. - See "Scored productions".
- Warning: the parser relies on the information in the vari
ous "this..." objects in some non-obvious ways. Tinkering
with the other members of these objects will probably
cause Bad Things to happen, unless you really know what
you're doing. The only exception to this advice is that
the use of "$this...->{local}" is always safe. - Start-up Actions
- Any actions which appear before the first rule definition
in a grammar are treated as "start-up" actions. Each such
action is stripped of its outermost brackets and then
evaluated (in the parser's special namespace) just before
the rules of the grammar are first compiled. - The main use of start-up actions is to declare local vari
ables within the parser's special namespace:
{ my $lastitem = '???'; }- list: item(s) { $return = $lastitem }
- item: book { $lastitem = 'book'; }
bell { $lastitem = 'bell'; }
candle { $lastitem = 'candle'; } - but start-up actions can be used to execute any valid Perl
code within a parser's special namespace. - Start-up actions can appear within a grammar extension or
replacement (that is, a partial grammar installed via
"Parse::RecDescent::Extend()" or "Parse::RecDes
cent::Replace()" - see "Incremental Parsing"), and will be
executed before the new grammar is installed. Note,
however, that a particular start-up action is only ever
executed once. - Autoactions
- It is sometimes desirable to be able to specify a default
action to be taken at the end of every production (for
example, in order to easily build a parse tree). If the
variable $::RD_AUTOACTION is defined when "Parse::RecDes
cent::new()" is called, the contents of that variable are
treated as a specification of an action which is to
appended to each production in the corresponding grammar.
So, for example, to construct a simple parse tree:
$::RD_AUTOACTION = q { [@item] };- parser = new Parse::RecDescent (q{
expression: and_expr '||' expression | and_expr
and_expr: not_expr '&&' and_expr | not_expr
not_expr: '!' brack_expr | brack_expr
brack_expr: '(' expression ')' | identifier
identifier: /[a-z]+/i
}); - which is equivalent to:
parser = new Parse::RecDescent (q{expression: and_expr '&&' expression{ [@item] }| and_expr{ [@item] }- and_expr: not_expr '&&' and_expr
- { [@item] }
- | not_expr
- { [@item] }
- not_expr: '!' brack_expr
- { [@item] }
- | brack_expr
- { [@item] }
- brack_expr: '(' expression ')'
- { [@item] }
- | identifier
- { [@item] }
- identifier: /[a-z]+/i
- { [@item] }
- });
- Alternatively, we could take an object-oriented approach,
use different classes for each node (and also eliminating
redundant intermediate nodes):
$::RD_AUTOACTION = q{ $#item==1 ? $item[1] : new ${"$item[0]_node"}(@item[1..$#item]) };- parser = new Parse::RecDescent (q{
expression: and_expr '||' expression | and_expr
and_expr: not_expr '&&' and_expr | not_expr
not_expr: '!' brack_expr | brack_expr
brack_expr: '(' expression ')' | identifier
identifier: /[a-z]+/i
}); - which is equivalent to:
parser = new Parse::RecDescent (q{expression: and_expr '&&' expression{ new expression_node(@item[1..3]) }| and_expr- and_expr: not_expr '&&' and_expr
- { new and_expr_node (@item[1..3])
- }
- | not_expr
- not_expr: '!' brack_expr
- { new not_expr_node (@item[1..2])
- }
- | brack_expr
- brack_expr: '(' expression ')'
- { new brack_expr_node
- (@item[1..3]) }
- | identifier
- identifier: /[a-z]+/i
- { new identifer_node (@item[1]) }
- });
- Note that, if a production already ends in an action, no
autoaction is appended to it. For example, in this ver
sion:
$::RD_AUTOACTION = q{ $#item==1 ? $item[1] : new ${"$item[0]_node"}(@item[1..$#item]) };- parser = new Parse::RecDescent (q{
expression: and_expr '&&' expression | and_expr
and_expr: not_expr '&&' and_expr | not_expr
not_expr: '!' brack_expr | brack_expr
brack_expr: '(' expression ')' | identifier
identifier: /[a-z]+/i{ new terminal_node($item[1]) }}); - each "identifier" match produces a "terminal_node" object,
not an "identifier_node" object. - A level 1 warning is issued each time an "autoaction" is
added to some production. - Autotrees
- A commonly needed autoaction is one that builds a
parse-tree. It is moderately tricky to set up such an
action (which must treat terminals differently from
non-terminals), so Parse::RecDescent simplifies the pro
cess by providing the "<autotree>" directive. - If this directive appears at the start of grammar, it
causes Parse::RecDescent to insert autoactions at the end
of any rule except those which already end in an action.
The action inserted depends on whether the production is
an intermediate rule (two or more items), or a terminal of
the grammar (i.e. a single pattern or string item). - So, for example, the following grammar:
<autotree>- file : command(s)
command : get | set | vet
get : 'get' ident ';'
set : 'set' ident 'to' value ';'
vet : 'check' ident 'is' value ';'
ident : /24
value : /+/ - is equivalent to:
file : command(s) { bless- item, $item[0] }
command : get { bless - item, $item[0] }
| set { blessitem, $item[0] }
| vet { blessitem, $item[0] } - get : 'get' ident ';' { bless
- item, $item[0] }
set : 'set' ident 'to' value ';' { bless - item, $item[0] }
vet : 'check' ident 'is' value ';' { bless - item, $item[0] }
- ident : /1248
value : /+/ { bless {__VAL - UE__=>$item[1]}, $item[0] }
- Note that each node in the tree is blessed into a class of
the same name as the rule itself. This makes it easy to
build object-oriented processors for the parse-trees that
the grammar produces. Note too that the last two rules
produce special objects with the single attribute
'__VALUE__'. This is because they consist solely of a sin
gle terminal. - This autoaction-ed grammar would then produce a parse tree
in a data structure like this:
{file => {command => {[ get => {identifier =>{ __VALUE__ => 'a' },},set => {identifier =>{ __VALUE__ => 'b' },
value =>{ __VALUE__ => '7' },},vet => {identifier =>{ __VALUE__ => 'b' },
value =>{ __VALUE__ => '7' },},- ],
- },
- }
- }
- (except, of course, that each nested hash would also be
blessed into the appropriate class). - Autostubbing
- Normally, if a subrule appears in some production, but no
rule of that name is ever defined in the grammar, the pro
duction which refers to the non-existent subrule fails
immediately. This typically occurs as a result of mis
spellings, and is a sufficiently common occurance that a
warning is generated for such situations. - However, when prototyping a grammar it is sometimes useful
to be able to use subrules before a proper specification
of them is really possible. For example, a grammar might
include a section like:
function_call: identifier '(' arg(s?) ')'- identifier: /[a-z]48
- where the possible format of an argument is sufficiently
complex that it is not worth specifying in full until the
general function call syntax has been debugged. In this
situation it is convenient to leave the real rule "arg"
undefined and just slip in a placeholder (or "stub"):
arg: 'arg'- so that the function call syntax can be tested with dummy
input such as:
f0()
f1(arg)
f2(arg arg)
f3(arg arg arg)- et cetera.
- Early in prototyping, many such "stubs" may be required,
so "Parse::RecDescent" provides a means of automating
their definition. If the variable $::RD_AUTOSTUB is
defined when a parser is built, a subrule reference to any
non-existent rule (say, "sr"), causes a "stub" rule of the
form:
sr: 'sr'- to be automatically defined in the generated parser. A
level 1 warning is issued for each such "autostubbed"
rule. - Hence, with $::AUTOSTUB defined, it is possible to only
partially specify a grammar, and then "fake" matches of
the unspecified (sub)rules by just typing in their name. - Look-ahead
- If a subrule, token, or action is prefixed by "...", then
it is treated as a "look-ahead" request. That means that
the current production can (as usual) only succeed if the
specified item is matched, but that the matching does not consume any of the text being parsed. This is very similar to the "/(?=...)/" look-ahead construct in Perl patterns.
Thus, the rule:
inner_word: word ...word- will match whatever the subrule "word" matches, provided
that match is followed by some more text which subrule
"word" would also match (although this second substring is
not actually consumed by "inner_word") - Likewise, a "...!" prefix, causes the following item to
succeed (without consuming any text) if and only if it
would normally fail. Hence, a rule such as:
identifier: ...!keyword ...!'_' /[A-Za-z_]24- matches a string of characters which satisfies the pattern
"/[A-Za-z_]864
characters would not match either subrule "keyword" or the
literal token '_'. - Sequences of look-ahead prefixes accumulate, multiplying
their positive and/or negative senses. Hence:
inner_word: word ...!......!word- is exactly equivalent the the original example above (a
warning is issued in cases like these, since they often
indicate something left out, or misunderstood). - Note that actions can also be treated as look-aheads. In
such cases, the state of the parser text (in the local
variable $text) after the look-ahead action is guaranteed
to be identical to its state before the action, regardless of how it's changed within the action (unless you actually undefine $text, in which case you get the disaster you
deserve :-). - Directives
- Directives are special pre-defined actions which may be
used to alter the behaviour of the parser. There are cur
rently eighteen directives: "<commit>", "<uncommit>",
"<reject>", "<score>", "<autoscore>", "<skip>",
"<resync>", "<error>", "<rulevar>", "<matchrule>",
"<leftop>", "<rightop>", "<defer>", "<nocheck>",
"<perl_quotelike>", "<perl_codeblock>", "<perl_variable>",
and "<token>". - Committing and uncommitting
- The "<commit>" and "<uncommit>" directives permit the
recursive descent of the parse tree to be pruned (or
"cut") for efficiency. Within a rule, a "<commit>"
directive instructs the rule to ignore subsequent pro
ductions if the current production fails. For example:
command: 'find' <commit> filename| 'open' <commit> filename
| 'move' filename filename - Clearly, if the leading token 'find' is matched in the
first production but that production fails for some
other reason, then the remaining productions cannot
possibly match. The presence of the "<commit>" causes
the "command" rule to fail immediately if an invalid
"find" command is found, and likewise if an invalid
"open" command is encountered. - It is also possible to revoke a previous commitment.
For example:
if_statement: 'if' <commit> condition'then' block <uncommit>
'else' block- | 'if' <commit> condition
'then' block
- In this case, a failure to find an "else" block in the
first production shouldn't preclude trying the second
production, but a failure to find a "condition" cer
tainly should. - As a special case, any production in which the first
item is an "<uncommit>" immediately revokes a
preceding "<commit>" (even though the production would
not otherwise have been tried). For example, in the
rule:
request: 'explain' expression| 'explain' <commit> keyword
| 'save'
| 'quit'
| <uncommit> term '?'- if the text being matched was "explain?", and the
first two productions failed, then the "<commit>" in
production two would cause productions three and four
to be skipped, but the leading "<uncommit>" in the
production five would allow that production to attempt
a match. - Note in the preceding example, that the "<commit>" was
only placed in production two. If production one had
been:
request: 'explain' <commit> expression- then production two would be (inappropriately) skipped
if a leading "explain..." was encountered. - Both "<commit>" and "<uncommit>" directives always
succeed, and their value is always 1. - Rejecting a production
- The "<reject>" directive immediately causes the cur
rent production to fail (it is exactly equivalent to,
but more obvious than, the action "{undef}"). A
"<reject>" is useful when it is desirable to get the
side effects of the actions in one production, without
prejudicing a match by some other production later in
the rule. For example, to insert tracing code into the
parse:
complex_rule: { print "In complex rule...0; }<reject>complex_rule: simple_rule '+' 'i' '*' simple_rule| 'i' '*' simple_rule
| simple_rule - It is also possible to specify a conditional rejec
tion, using the form "<reject:condition>", which only
rejects if the specified condition is true. This form
of rejection is exactly equivalent to the action
"{(condition)?undef:1}>". For example:
command: save_command| restore_command
| <reject: defined $::tolerant> { exit}
| <error: Unknown command. Ignored.>- A "<reject>" directive never succeeds (and hence has
no associated value). A conditional rejection may suc
ceed (if its condition is not satisfied), in which
case its value is 1. - As an extra optimization, "Parse::RecDescent" ignores
any production which begins with an unconditional
"<reject>" directive, since any such production can
never successfully match or have any useful
side-effects. A level 1 warning is issued in all such
cases. - Note that productions beginning with conditional
"<reject:...>" directives are never "optimized away"
in this manner, even if they are always guaranteed to
fail (for example: "<reject:1>") - Due to the way grammars are parsed, there is a minor
restriction on the condition of a conditional
"<reject:...>": it cannot contain any raw '<' or '>'
characters. For example:
line: cmd <reject: $thiscolumn > max> data- results in an error when a parser is built from this
grammar (since the grammar parser has no way of know
ing whether the first > is a "less than" or the end of
the "<reject:...>". - To overcome this problem, put the condition inside a
do{} block:
line: cmd <reject: do{$thiscolumn > max}> data- Note that the same problem may occur in other direc
tives that take arguments. The same solution will work
in all cases. - Skipping between terminals
- The "<skip>" directive enables the terminal prefix
used in a production to be changed. For example:
OneLiner: Command <skip:'[ ]*'> Arg(s) /;/ - causes only blanks and tabs to be skipped before ter
minals in the "Arg" subrule (and any of its subrules>,
and also before the final "/;/" terminal. Once the
production is complete, the previous terminal prefix
is reinstated. Note that this implies that distinct
productions of a rule must reset their terminal pre
fixes individually. - The "<skip>" directive evaluates to the previous ter
minal prefix, so it's easy to reinstate a prefix later
in a production:
Command: <skip:","> CSV(s) <skip:$item[1]>- Modifier
- The value specified after the colon is interpolated
into a pattern, so all of the following are equivalent
(though their efficiency increases down the list):
<skip: "$colon|$comma"> # ASSUMING THE VARS- HOLD THE OBVIOUS VALUES
- <skip: ':|,'>
- <skip: q{[:,]}>
- <skip: qr/[:,]/>
- There is no way of directly setting the prefix for an
entire rule, except as follows:
Rule: <skip: '[ ]*'> Prod1| <skip: '[ ]*'> Prod2a Prod2b
| <skip: '[ ]*'> Prod3- or, better:
Rule: <skip: '[ ]*'>(Prod1| Prod2a Prod2b
| Prod3- )
- Note: Up to release 1.51 of Parse::RecDescent, an entirely different mechanism was used for specifying terminal prefixes. The current method is not back wards-compatible with that early approach. The current approach is stable and will not to change again.
- Resynchronization
- The "<resync>" directive provides a visually distinc
tive means of consuming some of the text being parsed,
usually to skip an erroneous input. In its simplest
form "<resync>" simply consumes text up to and includ
ing the next newline ("0) character, succeeding only
if the newline is found, in which case it causes its
surrounding rule to return zero on success. - In other words, a "<resync>" is exactly equivalent to
the token "/[^0*0" followed by the action
"{ $return = 0 }" (except that productions beginning
with a "<resync>" are ignored when generating error
messages). A typical use might be:
script : command(s)command: save_command| restore_command
| <resync> # TRY NEXT LINE, IF POSSIBLE - It is also possible to explicitly specify a resynchro
nization pattern, using the "<resync:pattern>" vari
ant. This version succeeds only if the specified pat
tern matches (and consumes) the parsed text. In other
words, "<resync:pattern>" is exactly equivalent to the
token "/pattern/" (followed by a "{ $return = 0 }"
action). For example, if commands were terminated by
newlines or semi-colons:
command: save_command| restore_command
| <resync:[^;0*[;0>- The value of a successfully matched "<resync>" direc
tive (of either type) is the text that it consumed.
Note, however, that since the directive also sets
$return, a production consisting of a lone "<resync>"
succeeds but returns the value zero (which a calling
rule may find useful to distinguish between "true"
matches and "tolerant" matches). Remember that
returning a zero value indicates that the rule suc_
ceeded (since only an "undef" denotes failure within
"Parse::RecDescent" parsers. - Error handling
- The "<error>" directive provides automatic or userdefined generation of error messages during a parse.
In its simplest form "<error>" prepares an error mes
sage based on the mismatch between the last item
expected and the text which cause it to fail. For
example, given the rule:
McCoy: curse ',' name ', I'm a doctor, not a'a_profession '!'| pronoun 'dead,' name '!'
| <error> - the following strings would produce the following mes
sages: - "Amen, Jim!"
- ERROR (line 1): Invalid McCoy: Expected
- curse or pronoun
not found
- "Dammit, Jim, I'm a doctor!"
- ERROR (line 1): Invalid McCoy: Expected ",
- I'm a doctor, not a"
but found ", I'm a doctor!"instead
- "He's dead,0
- ERROR (line 2): Invalid McCoy: Expected
- name not found
- "He's alive!"
- ERROR (line 1): Invalid McCoy: Expected
- 'dead,' but found
"alive!" instead
- "Dammit, Jim, I'm a doctor, not a pointy-eared Vul
can!" - ERROR (line 1): Invalid McCoy: Expected a
- profession but found
"pointy-eared Vulcan!" instead
- Note that, when autogenerating error messages, all
underscores in any rule name used in a message are
replaced by single spaces (for example "a_production"
becomes "a production"). Judicious choice of rule
names can therefore considerably improve the readabil
ity of automatic error messages (as well as the main
tainability of the original grammar). - If the automatically generated error is not suffi
cient, it is possible to provide an explicit message
as part of the error directive. For example:
Spock: "Fascinating ',' (name | 'Captain') '.'| "Highly illogical, doctor."
| <error: He never said that!>- which would result in all failures to parse a "Spock"
subrule printing the following message:
ERROR (line <N>): Invalid Spock: He never said- that!
- The error message is treated as a "qq{...}" string and
interpolated when the error is generated (not when the
directive is specified!). Hence:
<error: Mystical error near "$text">- would correctly insert the ambient text string which
caused the error. - There are two other forms of error directive:
"<error?>" and "<error?: msg>". These behave just like
"<error>" and "<error: msg>" respectively, except that
they are only triggered if the rule is "committed" at
the time they are encountered. For example:
Scotty: "Ya kenna change the Laws of Phusics,"- <commit> name
| name <commit> ',' 'she's goanta blaw!'
| <error?> - will only generate an error for a string beginning
with "Ya kenna change the Laws o' Phusics," or a valid
name, but which still fails to match the corresponding
production. That is, "$parser->Scotty("Aye, Cap'ain")"
will fail silently (since neither production will
"commit" the rule on that input), whereas
"$parser->Scotty("Mr Spock, ah jest kenna do'ut!")"
will fail with the error message:
ERROR (line 1): Invalid Scotty: expected 'she's- goanta blaw!'
but found 'I jest kenna do'ut!'instead.
- since in that case the second production would commit
after matching the leading name. - Note that to allow this behaviour, all "<error>"
directives which are the first item in a production
automatically uncommit the rule just long enough to
allow their production to be attempted (that is, when
their production fails, the commitment is reinstated
so that subsequent productions are skipped). - In order to permanently uncommit the rule before an
error message, it is necessary to put an explicit
"<uncommit>" before the "<error>". For example:
line: 'Kirk:' <commit> Kirk| 'Spock:' <commit> Spock
| 'McCoy:' <commit> McCoy
| <uncommit> <error?> <reject>
| <resync>- Error messages generated by the various "<error...>"
directives are not displayed immediately. Instead,
they are "queued" in a buffer and are only displayed
once parsing ultimately fails. Moreover, "<error...>"
directives that cause one production of a rule to fail
are automatically removed from the message queue if
another production subsequently causes the entire rule
to succeed. This means that you can put "<error...>"
directives wherever useful diagnosis can be done, and
only those associated with actual parser failure will
ever be displayed. Also see "Gotchas". - As a general rule, the most useful diagnostics are
usually generated either at the very lowest level
within the grammar, or at the very highest. A good
rule of thumb is to identify those subrules which con
sist mainly (or entirely) of terminals, and then put
an "<error...>" directive at the end of any other rule
which calls one or more of those subrules. - There is one other situation in which the output of
the various types of error directive is suppressed;
namely, when the rule containing them is being parsed
as part of a "look-ahead" (see "Look-ahead"). In this
case, the error directive will still cause the rule to
fail, but will do so silently. - An unconditional "<error>" directive always fails (and
hence has no associated value). This means that
encountering such a directive always causes the
production containing it to fail. Hence an "<error>"
directive will inevitably be the last (useful) item of
a rule (a level 3 warning is issued if a production
contains items after an unconditional "<error>" direc
tive). - An "<error?>" directive will succeed (that is: fail to
fail :-), if the current rule is uncommitted when the
directive is encountered. In that case the directive's
associated value is zero. Hence, this type of error
directive can be used before the end of a production.
For example:
command: 'do' <commit> something| 'report' <commit> something
| <error?: Syntax error> <error: Unknown command>- Warning: The "<error?>" directive does not mean
"always fail (but do so silently unless committed)".
It actually means "only fail (and report) if commit
ted, otherwise succeed". To achieve the "fail silently if uncommitted" semantics, it is necessary to use:
rule: item <commit> item(s)| <error?> <reject> # FAIL SILENTLYUNLESS COMMITTED- However, because people seem to expect a lone
"<error?>" directive to work like this:
rule: item <commit> item(s)| <error?: Error message if committed>
| <error: Error message if uncommitted>- Parse::RecDescent automatically appends a "<reject>"
directive if the "<error?>" directive is the only item
in a production. A level 2 warning (see below) is
issued when this happens. - The level of error reporting during both parser con
struction and parsing is controlled by the presence or
absence of four global variables: $::RD_ERRORS,
$::RD_WARN, $::RD_HINT, and <$::RD_TRACE>. If
$::RD_ERRORS is defined (and, by default, it is) then
fatal errors are reported. - Whenever $::RD_WARN is defined, certain non-fatal
problems are also reported. Warnings have an associ
ated "level": 1, 2, or 3. The higher the level, the
more serious the warning. The value of the correspond
ing global variable ($::RD_WARN) determines the lowest level of warning to be displayed. Hence, to see all
warnings, set $::RD_WARN to 1. To see only the most
serious warnings set $::RD_WARN to 3. By default
$::RD_WARN is initialized to 3, ensuring that serious
but non-fatal errors are automatically reported. - See "DIAGNOSTICS" for a list of the varous error and
warning messages that Parse::RecDescent generates when
these two variables are defined. - Defining any of the remaining variables (which are not
defined by default) further increases the amount of
information reported. Defining $::RD_HINT causes the
parser generator to offer more detailed analyses and
hints on both errors and warnings. Note that setting
$::RD_HINT at any point automagically sets $::RD_WARN
to 1. - Defining $::RD_TRACE causes the parser generator and
the parser to report their progress to STDERR in
excruciating detail (although, without hints unless
$::RD_HINT is separately defined). This detail can be
moderated in only one respect: if $::RD_TRACE has an
integer value (N) greater than 1, only the N charac
ters of the "current parsing context" (that is, where
in the input string we are at any point in the parse)
is reported at any time. - $::RD_TRACE is mainly useful for debugging a grammar
that isn't behaving as you expected it to. To this
end, if $::RD_TRACE is defined when a parser is built,
any actual parser code which is generated is also
written to a file named "RD_TRACE" in the local direc
tory. - Note that the four variables belong to the "main"
package, which makes them easier to refer to in the
code controlling the parser, and also makes it easy to
turn them into command line flags ("-RD_ERRORS",
"-RD_WARN", "-RD_HINT", "-RD_TRACE") under perl -s. - Specifying local variables
- It is occasionally convenient to specify variables
which are local to a single rule. This may be achieved
by including a "<rulevar:...>" directive anywhere in
the rule. For example:
markup: <rulevar: $tag>markup: tag {($tag=$item[1]) =~ s/^<|>$//g}body[$tag] - The example "<rulevar: $tag>" directive causes a "my"
variable named $tag to be declared at the start of the
subroutine implementing the "markup" rule (that is,
before the first production, regardless of where in
the rule it is specified). - Specifically, any directive of the form: "<rule
var:text>" causes a line of the form "my text;" to be
added at the beginning of the rule subroutine, immedi
ately after the definitions of the following local
variables:
$thisparser $commit
$thisrule @item
$thisline @arg
$text %arg- This means that the following "<rulevar>" directives
work as expected:
<rulevar: $count = 0 >- <rulevar: $firstarg = $arg[0] || '' >
- <rulevar: $myItems = @item >
- <rulevar: @context = ( $thisline, $text, @arg
- ) >
- <rulevar: ($name,$age) = $arg{"name","age"} >
- If a variable that is also visible to subrules is
required, it needs to be "local"'d, not "my"'d. "rule
var" defaults to "my", but if "local" is explicitly
specified:
<rulevar: local $count = 0 >- then a "local"-ized variable is declared instead, and
will be available within subrules. - Note however that, because all such variables are "my"
variables, their values do not persist between match attempts on a given rule. To preserve values between
match attempts, values can be stored within the
"local" member of the $thisrule object:
countedrule: { $thisrule->{"local"}{"count"}++- }
<reject>
- | subrule1
| subrule2
| <reject: $thisrule->{"lo - cal"}{"count"} == 1>
subrule3
- When matching a rule, each "<rulevar>" directive is
matched as if it were an unconditional "<reject>"
directive (that is, it causes any production in which
it appears to immediately fail to match). For this
reason (and to improve readability) it is usual to
specify any "<rulevar>" directive in a separate pro
duction at the start of the rule (this has the added
advantage that it enables "Parse::RecDescent" to opti
mize away such productions, just as it does for the
"<reject>" directive). - Dynamically matched rules
- Because regexes and double-quoted strings are interpo
lated, it is relatively easy to specify productions
with "context sensitive" tokens. For example:
command: keyword body "end $item[1]" - which ensures that a command block is bounded by a
"<keyword>...end <same keyword>" pair. - Building productions in which subrules are context
sensitive is also possible, via the "<matchrule:...>"
directive. This directive behaves identically to a
subrule item, except that the rule which is invoked to
match it is determined by the string specified after
the colon. For example, we could rewrite the "command"
rule like this:
command: keyword <matchrule:body> "end- $item[1]"
- Whatever appears after the colon in the directive is
treated as an interpolated string (that is, as if it
appeared in "qq{...}" operator) and the value of that
interpolated string is the name of the subrule to be
matched. - Of course, just putting a constant string like "body"
in a "<matchrule:...>" directive is of little interest
or benefit. The power of directive is seen when we
use a string that interpolates to something interest
ing. For example:
command: keyword- <matchrule:$item[1]_body> "end $item[1]"
- keyword: 'while' | 'if' | 'function'
- while_body: condition block
- if_body: condition block ('else'
- block)(?)
- function_body: arglist block
- Now the "command" rule selects how to proceed on the
basis of the keyword that is found. It is as if "com
mand" were declared:
command: 'while' while_body "end- while"
| 'if' if_body "endif"
| 'function' function_body "endfunction" - When a "<matchrule:...>" directive is used as a
repeated subrule, the rule name expression is
"late-bound". That is, the name of the rule to be
called is re-evaluated each time a match attempt is made. Hence, the following grammar:
{ $::species = 'dogs' }- pair: 'two' <matchrule:$::species>(s)
- dogs: /dogs/ { $::species = 'cats' }
- cats: /cats/
- will match the string "two dogs cats cats" completely,
whereas it will only match the string "two dogs dogs
dogs" up to the eighth letter. If the rule name were
"early bound" (that is, evaluated only the first time
the directive is encountered in a production), the
reverse behaviour would be expected. - Deferred actions
- The "<defer:...>" directive is used to specify an
action to be performed when (and only if!) the current
production ultimately succeeds. - Whenever a "<defer:...>" directive appears, the code
it specifies is converted to a closure (an anonymous
subroutine reference) which is queued within the
active parser object. Note that, because the deferred
code is converted to a closure, the values of any
"local" variable (such as $text, <@item>, etc.) are
preserved until the deferred code is actually exe
cuted. - If the parse ultimately succeeds and the production in
which the "<defer:...>" directive was evaluated formed
part of the successful parse, then the deferred code
is executed immediately before the parse returns. If
however the production which queued a deferred action
fails, or one of the higher-level rules which called
that production fails, then the deferred action is
removed from the queue, and hence is never executed. - For example, given the grammar:
sentence: noun trans noun| noun intrans - noun: 'the dog'
- { print "$item[1](noun)0 }
- | 'the meat'
- { print "$item[1](noun)0 }
- trans: 'ate'
- { print "$item[1](transitive)0
- }
- intrans: 'ate'
- { print "$item[1](intransi
- tive)0 }
- | 'barked'
- { print "$item[1](intransi
- tive)0 }
- then parsing the sentence "the dog ate" would produce
the output:
the dog (noun)
ate (transitive)
the dog (noun)
ate (intransitive)- This is because, even though the first production of
"sentence" ultimately fails, its initial subrules
"noun" and "trans" do match, and hence they execute
their associated actions. Then the second production
of "sentence" succeeds, causing the actions of the
subrules "noun" and "intrans" to be executed as well. - On the other hand, if the actions were replaced by
"<defer:...>" directives:
sentence: noun trans noun| noun intrans- noun: 'the dog'
- <defer: print "$item[1](noun)0
- >
- | 'the meat'
- <defer: print "$item[1](noun)0
- >
- trans: 'ate'
- <defer: print "$item[1](tran
- sitive)0 >
- intrans: 'ate'
- <defer: print "$item[1](in
- transitive)0 >
- | 'barked'
- <defer: print "$item[1](in
- transitive)0 >
- the output would be:
the dog (noun)
ate (intransitive)- since deferred actions are only executed if they were
evaluated in a production which ultimately contributes
to the successful parse. - In this case, even though the first production of
"sentence" caused the subrules "noun" and "trans" to
match, that production ultimately failed and so the
deferred actions queued by those subrules were subse
quently disgarded. The second production then suc
ceeded, causing the entire parse to succeed, and so
the deferred actions queued by the (second) match of
the "noun" subrule and the subsequent match of
"intrans" are preserved and eventually executed. - Deferred actions provide a means of improving the per
formance of a parser, by only executing those actions
which are part of the final parse-tree for the input
data. - Alternatively, deferred actions can be viewed as a
mechanism for building (and executing) a customized
subroutine corresponding to the given input data, much
in the same way that autoactions (see "Autoactions")
can be used to build a customized data structure for
specific input. - Whether or not the action it specifies is ever exe
cuted, a "<defer:...>" directive always succeeds,
returning the number of deferred actions currently
queued at that point. - Parsing Perl
- Parse::RecDescent provides limited support for parsing
subsets of Perl, namely: quote-like operators, Perl
variables, and complete code blocks. - The "<perl_quotelike>" directive can be used to parse
any Perl quote-like operator: 'a string', "m/a pat
tern/", "tr{ans}{lation}", etc. It does this by call
ing Text::Balanced::quotelike(). - If a quote-like operator is found, a reference to an
array of eight elements is returned. Those elements
are identical to the last eight elements returned by
Text::Balanced::extract_quotelike() in an array con text, namely: - [0] the name of the quotelike operator -- 'q', 'qq',
'm', 's', 'tr' -- if the operator was named; oth
erwise "undef", - [1] the left delimiter of the first block of the oper
ation,
- [2] the text of the first block of the operation (that
is, the contents of a quote, the regex of a match,
or substitution or the target list of a transla
tion), - [3] the right delimiter of the first block of the
operation,
- [4] the left delimiter of the second block of the
operation if there is one (that is, if it is a
"s", "tr", or "y"); otherwise "undef", - [5] the text of the second block of the operation if
there is one (that is, the replacement of a sub
stitution or the translation list of a transla
tion); otherwise "undef", - [6] the right delimiter of the second block of the
operation (if any); otherwise "undef",
- [7] the trailing modifiers on the operation (if any);
otherwise "undef".
- If a quote-like expression is not found, the directive
fails with the usual "undef" value. - The "<perl_variable>" directive can be used to parse
any Perl variable: $scalar, @array, %hash,
$ref->{field}[$index], etc. It does this by calling
Text::Balanced::extract_variable(). - If the directive matches text representing a valid
Perl variable specification, it returns that text.
Otherwise it fails with the usual "undef" value. - The "<perl_codeblock>" directive can be used to parse
curly-brace-delimited block of Perl code, such as: {
$a = 1; f() =~ m/pat/; }. It does this by calling
Text::Balanced::extract_codeblock(). - If the directive matches text representing a valid
Perl code block, it returns that text. Otherwise it
fails with the usual "undef" value. - Constructing tokens
- Eventually, Parse::RecDescent will be able to parse
tokenized input, as well as ordinary strings. In
preparation for this joyous day, the "<token:...>"
directive has been provided. This directive creates a
token which will be suitable for input to a
Parse::RecDescent parser (when it eventually supports
tokenized input). - The text of the token is the value of the immediately
preceding item in the production. A "<token:...>"
directive always succeeds with a return value which is
the hash reference that is the new token. It also sets
the return value for the production to that hash ref. - The "<token:...>" directive makes it easy to build a
Parse::RecDescent-compatible lexer in Parse::RecDes
cent: < <t t - my $lexer = new Paoso::RecDescent q
{ k klex: toeee(s)n ntoken: /a: :| /theN
| /flyO
| /[a-z]+/i { lc $item[1] }<token:ALPHA>
| <error: Unknown token>F ,}; > VEwhich will eventually be able to be used with a regu
lar Parse::RecDescent grammar:>my $parser = new Parse::RecDescent q
{startrule: subrule1 subrule 2# ETC...}; - either with a pre-lexing phase:
$parser->startrule( $lexer->lex($data) );- or with a lex-on-demand approach:
$parser->startrule( sub{$lexer->token(ata)} );- But at present, only the "<token:...>" directive is
actually implemented. The rest is vapourware. - Specifying operations
- One of the commonest requirements when building a
parser is to specify binary operators. Unfortunately,
in a normal grammar, the rules for such things are
awkward:
disjunction: conjunction ('or' conjunction)(s?){ $return = [$item[1], @{$item[2]} ] } - conjunction: atom ('and' atom)(s?)
- { $return = [
- $item[1], @{$item[2]} ] }
- or inefficient:
disjunction: conjunction 'or' disjunction{ $return = [$item[1], @{$item[2]} ] }- | conjunction
{ $return = [ $item[1]] }
- conjunction: atom 'and' conjunction
- { $return = [
- $item[1], @{$item[2]} ] }
- | atom
- { $return = [ $item[1]
- ] }
- and either way is ugly and hard to get right.
- The "<leftop:...>" and "<rightop:...>" directives pro
vide an easier way of specifying such operations.
Using "<leftop:...>" the above examples become:
disjunction: <leftop: conjunction 'or' con- junction>
conjunction: <leftop: atom 'and' atom> - The "<leftop:...>" directive specifies a left-associa
tive binary operator. It is specified around three
other grammar elements (typically subrules or termi
nals), which match the left operand, the operator
itself, and the right operand respectively. - A "<leftop:...>" directive such as:
disjunction: <leftop: conjunction 'or' con- junction>
- is converted to the following:
disjunction: ( conjunction ('or' conjunc- tion)(s?)
{ $return = [$item[1], @{$item[2]} ] } )
- In other words, a "<leftop:...>" directive matches the
left operand followed by zero or more repetitions of
both the operator and the right operand. It then flat
tens the matched items into an anonymous array which
becomes the (single) value of the entire
"<leftop:...>" directive. - For example, an "<leftop:...>" directive such as:
output: <leftop: ident '<<' expr >- when given a string such as:
cout << var << "str" << 3- would match, and $item[1] would be set to:
[ 'cout', 'var', '"str"', '3' ]- In other words:
output: <leftop: ident '<<' expr >- is equivalent to a left-associative operator:
output: ident- { $return = [$item[1]] }
| ident '<<' expr{ $return = [@item[1,3]] }
| ident '<<' expr '<<' expr{ $return = [@item[1,3,5]] }
| ident '<<' expr '<<' expr '<<' expr{ $return = [@item[1,3,5,7]] }
# ...etc... - Similarly, the "<rightop:...>" directive takes a left
operand, an operator, and a right operand:
assign: <rightop: var '=' expr >- and converts them to:
assign: ( (var '=' {$return=$item[1]})(s?)- expr
{ $return = [@{$item[1]}, $item[2] ] } )
- which is equivalent to a right-associative operator:
assign: var { $re- turn = [$item[1]] }
| var '=' expr { $return = [@item[1,3]] }
| var '=' var '=' expr { $return = [@item[1,3,5]] }
| var '=' var '=' var '=' expr { $return = [@item[1,3,5,7]] }
# ...etc... - Note that for both the "<leftop:...>" and
"<rightop:...>" directives, the directive does not
normally return the operator itself, just a list of
the operands involved. This is particularly handy for
specifying lists:
list: '(' <leftop: list_item ',' list_item>- ')'
{ $return = $item[2] }
- There is, however, a problem: sometimes the operator
is itself significant. For example, in a Perl list a
comma and a "=>" are both valid separators, but the
"=>" has additional stringification semantics. Hence
it's important to know which was used in each case. - To solve this problem the "<leftop:...>" and
"<rightop:...>" directives do return the operator(s)
as well, under two circumstances. The first case is
where the operator is specified as a subrule. In that
instance, whatever the operator matches is returned
(on the assumption that if the operator is important
enough to have its own subrule, then it's important
enough to return). - The second case is where the operator is specified as
a regular expression. In that case, if the first
bracketed subpattern of the regular expression
matches, that matching value is returned (this is
analogous to the behaviour of the Perl "split" func
tion, except that only the first subpattern is
returned). - In other words, given the input:
( a=>1, b=>2 )- the specifications:
list: '(' <leftop: list_item separator- list_item> ')'
- separator: ',' | '=>'
- or:
list: '(' <leftop: list_item /(,|=>)/- list_item> ')'
- cause the list separators to be interleaved with the
operands in the anonymous array in $item[2]:
[ 'a', '=>', '1', ',', 'b', '=>', '2' ]- But the following version:
list: '(' <leftop: list_item /,|=>/- list_item> ')'
- returns only the operators:
[ 'a', '1', 'b', '2' ]- Of course, none of the above specifications handle the
case of an empty list, since the "<leftop:...>" and
"<rightop:...>" directives require at least a single
right or left operand to match. To specify that the
operator can match "trivially", it's necessary to add
a "(?)" qualifier to the directive:
list: '(' <leftop: list_item /(,|=>)/- list_item>(?) ')'
- Note that in almost all the above examples, the first
and third arguments of the "<leftop:...>" directive
were the same subrule. That is because
"<leftop:...>"'s are frequently used to specify "sepa
rated" lists of the same type of item. To make such
lists easier to specify, the following syntax:
list: element(s /,/)- is exactly equivalent to:
list: <leftop: element /,/ element>- Note that the separator must be specified as a raw
pattern (i.e. not a string or subrule). - Scored productions
- By default, Parse::RecDescent grammar rules always
accept the first production that matches the input.
But if two or more productions may potentially match
the same input, choosing the first that does so may
not be optimal. - For example, if you were parsing the sentence "time
flies like an arrow", you might use a rule like this:
sentence: verb noun preposition article noun {[@item] }| adjective noun verb article noun {[@item] }
| noun verb preposition article noun {[@item] } - Each of these productions matches the sentence, but
the third one is the most likely interpretation. How
ever, if the sentence had been "fruit flies like a
banana", then the second production is probably the
right match. - To cater for such situtations, the "<score:...>" can
be used. The directive is equivalent to an uncondi
tional "<reject>", except that it allows you to spec
ify a "score" for the current production. If that
score is numerically greater than the best score of
any preceding production, the current production is
cached for later consideration. If no later production
matches, then the cached production is treated as hav
ing matched, and the value of the item immediately
before its "<score:...>" directive is returned as the
result. - In other words, by putting a "<score:...>" directive
at the end of each production, you can select which
production matches using criteria other than specifi
cation order. For example:
sentence: verb noun preposition article noun {- [@item] } <score: sensible(@item)>
| adjective noun verb article noun {[@item] } <score: sensible(@item)>
| noun verb preposition article noun {[@item] } <score: sensible(@item)> - Now, when each production reaches its respective
"<score:...>" directive, the subroutine "sensible"
will be called to evaluate the matched items (some
how). Once all productions have been tried, the one
which "sensible" scored most highly will be the one
that is accepted as a match for the rule. - The variable $score always holds the current best
score of any production, and the variable
$score_return holds the corresponding return value. - As another example, the following grammar matches
lines that may be separated by commas, colons, or
semi-colons. This can be tricky if a colon-separated
line also contains commas, or vice versa. The grammar
resolves the ambiguity by selecting the rule that
results in the fewest fields:
line: seplist[sep=>','] <score: -@{$item[1]}>| seplist[sep=>':'] <score: -@{$item[1]}>
| seplist[sep=>" "] <score: -@{$item[1]}>- seplist: <skip:""> <leftop: /[^$arg{sep}]*/
- "$arg{sep}" /[^$arg{sep}]*/>
- Note the use of negation within the "<score:...>"
directive to ensure that the seplist with the most
items gets the lowest score. - As the above examples indicate, it is often the case
that all productions in a rule use exactly the same
"<score:...>" directive. It is tedious to have to
repeat this identical directive in every production,
so Parse::RecDescent also provides the
"<autoscore:...>" directive. - If an "<autoscore:...>" directive appears in any pro
duction of a rule, the code it specifies is used as
the scoring code for every production of that rule,
except productions that already end with an explicit
"<score:...>" directive. Thus the rules above could be
rewritten:
line: <autoscore: -@{$item[1]}>
line: seplist[sep=>',']| seplist[sep=>':']
| seplist[sep=>" "]- sentence: <autoscore: sensible(@item)>
| verb noun preposition article noun {[@item] }
| adjective noun verb article noun {[@item] }
| noun verb preposition article noun {[@item] } - Note that the "<autoscore:...>" directive itself acts
as an unconditional "<reject>", and (like the "<rule
var:...>" directive) is pruned at compile-time wher
ever possible. - Dispensing with grammar checks
- During the compilation phase of parser construction,
Parse::RecDescent performs a small number of checks on
the grammar it's given. Specifically it checks that
the grammar is not left-recursive, that there are no
"insatiable" constructs of the form:
rule: subrule(s) subrule - and that there are no rules missing (i.e. referred to,
but never defined). - These checks are important during development, but can
slow down parser construction in stable code. So
Parse::RecDescent provides the <nocheck> directive to
turn them off. The directive can only appear before
the first rule definition, and switches off checking
throughout the rest of the current grammar. - Typically, this directive would be added when a parser
has been thoroughly tested and is ready for release. - Subrule argument lists
- It is occasionally useful to pass data to a subrule which
is being invoked. For example, consider the following
grammar fragment:
classdecl: keyword decl- keyword: 'struct' | 'class';
- decl: # WHATEVER
- The "decl" rule might wish to know which of the two key
words was used (since it may affect some aspect of the way
the subsequent declaration is interpreted). "Parse::RecDe
scent" allows the grammar designer to pass data into a
rule, by placing that data in an argument list (that is, in square brackets) immediately after any subrule item in
a production. Hence, we could pass the keyword to "decl"
as follows:
classdecl: keyword decl[ $item[1] ]- keyword: 'struct' | 'class';
- decl: # WHATEVER
- The argument list can consist of any number (including
zero!) of comma-separated Perl expressions. In other
words, it looks exactly like a Perl anonymous array refer
ence. For example, we could pass the keyword, the name of
the surrounding rule, and the literal 'keyword' to "decl"
like so:
classdecl: keyword decl[$item[1],$item[0],'key- word']
- keyword: 'struct' | 'class';
- decl: # WHATEVER
- Within the rule to which the data is passed ("decl" in the
above examples) that data is available as the elements of
a local variable @arg. Hence "decl" might report its
intentions as follows:
classdecl: keyword decl[$item[1],$item[0],'key- word']
- keyword: 'struct' | 'class';
- decl: { print "Declaring $arg[0] (a
- $arg[2])0;
print "(this rule called by $arg[1])"}
- Subrule argument lists can also be interpreted as hashes,
simply by using the local variable %arg instead of @arg.
Hence we could rewrite the previous example:
classdecl: keyword decl[keyword => $item[1],caller => $item[0],
type => 'keyword']- keyword: 'struct' | 'class';
- decl: { print "Declaring $arg{keyword} (a
- $arg{type})0;
- print "(this rule called by
- $arg{caller})" }
- Both @arg and %arg are always available, so the grammar
designer may choose whichever convention (or combination
of conventions) suits best. - Subrule argument lists are also useful for creating "rule
templates" (especially when used in conjunction with the
"<matchrule:...>" directive). For example, the subrule:
list: <matchrule:$arg{rule}> /$arg{sep}/- list[%arg]
{ $return = [ $item[1],@{$item[3]} ] }
- | <matchrule:$arg{rule}>
{ $return = [ $item[1]] }
- is a handy template for the common problem of matching a
separated list. For example:
function: 'func' name '('- list[rule=>'param',sep=>';'] ')'
- param: list[rule=>'name',sep=>','] ':' typename
- name: /24
- typename: name
- When a subrule argument list is used with a repeated sub
rule, the argument list goes before the repetition
specifier:
list: /some|many/ thing[ $item[1] ](s)- The argument list is "late bound". That is, it is re-eval
uated for every repetition of the repeated subrule. This
means that each repeated attempt to match the subrule may
be passed a completely different set of arguments if the
value of the expression in the argument list changes
between attempts. So, for example, the grammar:
{ $::species = 'dogs' }- pair: 'two' animal[$::species](s)
- animal: /$arg[0]/ { $::species = 'cats' }
- will match the string "two dogs cats cats" completely,
whereas it will only match the string "two dogs dogs dogs"
up to the eighth letter. If the value of the argument list
were "early bound" (that is, evaluated only the first time
a repeated subrule match is attempted), one would expect
the matching behaviours to be reversed. - Of course, it is possible to effectively "early bind" such
argument lists by passing them a value which does not
change on each repetition. For example:
{ $::species = 'dogs' }- pair: 'two' { $::species } animal[$item[2]](s)
- animal: /$arg[0]/ { $::species = 'cats' }
- Arguments can also be passed to the start rule, simply by
appending them to the argument list with which the start
rule is called (after the "line number" parameter). For
example, given:
$parser = new Parse::RecDescent ( $grammar );- $parser->data($text, 1, "str", 2, @arr);
- # ^^^^^ ^ ^^^^^^^^^^^^^^^
#
# TEXT TO BE PARSED
# STARTING LINE NUMBER
# ELEMENTS OF @arg WHICH IS PASSED TO RULE data - then within the productions of the rule "data", the array
@arg will contain "("str", 2, @arr)". - Alternations
- Alternations are implicit (unnamed) rules defined as part
of a production. An alternation is defined as a series of
'|'-separated productions inside a pair of round brackets.
For example:
character: 'the' ( good | bad | ugly ) /dude/- Every alternation implicitly defines a new subrule, whose
automatically-generated name indicates its origin:
"_alternation_<I>_of_production_<P>_of_rule<R>" for the
appropriate values of <I>, <P>, and <R>. A call to this
implicit subrule is then inserted in place of the
brackets. Hence the above example is merely a convenient
short-hand for:
character: 'the'_alternation_1_of_production_1_of_rule_character
/dude/- _alternation_1_of_production_1_of_rule_character:
- good | bad | ugly
- Since alternations are parsed by recursively calling the
parser generator, any type(s) of item can appear in an
alternation. For example:
character: 'the' ( 'high' "plains" # Silent,- with poncho
| /no[- ]name/ # Silent,no poncho
| vengeance_seeking # Ponchooptional
| <error>
) drifter - In this case, if an error occurred, the automatically gen
erated message would be:
ERROR (line <N>): Invalid implicit subrule: Ex- pected
'high' or /no[- ]name/ or generic,
but found "pacifist" instead - Since every alternation actually has a name, it's even
possible to extend or replace them:
parser->Replace("_alternation_1_of_production_1_of_rule_character:'generic Eastwood'"
);- More importantly, since alternations are a form of sub
rule, they can be given repetition specifiers:
character: 'the' ( good | bad | ugly )(?) /dude/- Incremental Parsing
- "Parse::RecDescent" provides two methods - "Extend" and
"Replace" - which can be used to alter the grammar matched
by a parser. Both methods take the same argument as
"Parse::RecDescent::new", namely a grammar specification
string - "Parse::RecDescent::Extend" interprets the grammar speci
fication and adds any productions it finds to the end of
the rules for which they are specified. For example:
$add = "name: 'Jimmy-Bob' | 'Bobby-Jim'0esc:- colour /necks?/";
parser->Extend($add); - adds two productions to the rule "name" (creating it if
necessary) and one production to the rule "desc". - "Parse::RecDescent::Replace" is identical, except that it
first resets are rule specified in the additional grammar,
removing any existing productions. Hence after:
$add = "name: 'Jimmy-Bob' | 'Bobby-Jim'0esc:- colour /necks?/";
parser->Replace($add); - are are only valid "name"s and the one possible
description. - A more interesting use of the "Extend" and "Replace" meth
ods is to call them inside the action of an executing
parser. For example:
typedef: 'typedef' type_name identifier ';'{ $thisparser->Extend("type_name:'$item[3]'") }- | <error>
- identifier: ...!type_name /[A-Za-z_]w*/
- which automatically prevents type names from being type
def'd, or:
command: 'map' key_name 'to' abort_key{ $thisparser->Replace("abort_key:'$item[2]'") }- | 'map' key_name 'to' key_name
{ map_key($item[2],$item[4]) }
- | abort_key
{ exit if confirm("abort?") }
- abort_key: 'q'
- key_name: ...!abort_key /[A-Za-z]/
- which allows the user to change the abort key binding, but
not to unbind it. - The careful use of such constructs makes it possible to
reconfigure a a running parser, eliminating the need for
semantic feedback by providing syntactic feedback instead.
However, as currently implemented, "Replace()" and
"Extend()" have to regenerate and re-"eval" the entire
parser whenever they are called. This makes them quite
slow for large grammars. - In such cases, the judicious use of an interpolated regex
is likely to be far more efficient:
typedef: 'typedef' type_name/ identifier ';'{ $thisparser->{local}{type_name}.= "|$item[3]" }- | <error>
- identifier: ...!type_name /[A-Za-z_]w*/
- type_name: /$thisparser->{local}{type_name}/
- Precompiling parsers
- Normally Parse::RecDescent builds a parser from a grammar
at run-time. That approach simplifies the design and
implementation of parsing code, but has the disadvantage
that it slows the parsing process down - you have to wait
for Parse::RecDescent to build the parser every time the
program runs. Long or complex grammars can be particularly
slow to build, leading to unacceptable delays at start-up. - To overcome this, the module provides a way of "pre-build
ing" a parser object and saving it in a separate module.
That module can then be used to create clones of the orig
inal parser. - A grammar may be precompiled using the "Precompile" class
method. For example, to precompile a grammar stored in
the scalar $grammar, and produce a class named PreGrammar
in a module file named PreGrammar.pm, you could use:
use Parse::RecDescent;- Parse::RecDescent->Precompile($grammar, "PreGram
- mar");
- The first argument is the grammar string, the second is
the name of the class to be built. The name of the module
file is generated automatically by appending ".pm" to the
last element of the class name. Thus
Parse::RecDescent->Precompile($grammar,- "My::New::Parser");
- would produce a module file named Parser.pm.
- It is somewhat tedious to have to write a small Perl pro
gram just to generate a precompiled grammar class, so
Parse::RecDescent has some special magic that allows you
to do the job directly from the command-line. - If your grammar is specified in a file named grammar, you can generate a class named Yet::Another::Grammar like so:
> perl -MParse::RecDescent - grammar Yet::Anoth- er::Grammar
- This would produce a file named Grammar.pm containing the
full definition of a class called Yet::Another::Grammar.
Of course, to use that class, you would need to put the
Grammar.pm file in a directory named Yet/Another, some where in your Perl include path. - Having created the new class, it's very easy to use it to
build a parser. You simply "use" the new module, and then
call its "new" method to create a parser object. For exam
ple:
use Yet::Another::Grammar;
my $parser = Yet::Another::Grammar->new();- The effect of these two lines is exactly the same as:
use Parse::RecDescent;- open GRAMMAR_FILE, "grammar" or die;
local $/;
my $grammar = <GRAMMAR_FILE>; - my $parser = Parse::RecDescent->new($grammar);
- only considerably faster.
- Note however that the parsers produced by either approach
are exactly the same, so whilst precompilation has an
effect on set-up speed, it has no effect on parsing speed. RecDescent 2.0 will address that problem. - A Metagrammar for "Parse::RecDescent"
- The following is a specification of grammar format
accepted by "Parse::RecDescent::new" (specified in the
"Parse::RecDescent" grammar format!):
grammar : components(s)- component : rule | comment
- rule : "0 identifier ":" production(s?)
- production : items(s)
- item : lookahead(?) simpleitem
| directive
| comment - lookahead : '...' | '...!' # +'ve or
- -'ve lookahead
- simpleitem : subrule args(?) # match an
- other rule
- | repetition # match re
- peated subrules
| terminal # match the - next input
| bracket args(?) # match al - ternative items
| action # do some - thing
- subrule : identifier # the name
- of the rule
- args : {extract_codeblock($text,'[]')} # just like
- a [...] array ref
- repetition : subrule args(?) howoften
- howoften : '(?)' # 0 or 1
- times
- | '(s?)' # 0 or more
- times
| '(s)' # 1 or more - times
| /(+)[.][.](/+)/ # $1 to $2 - times
| /[.][.](/*)/ # at most $1 - times
| /(*)[.][.])/ # at least $1 - times
- terminal : /[/]([][/]|[^/])*[/]/ # interpo
- lated pattern
- | /"([]"|[^"])*"/ # interpo
- lated literal
| /'([]'|[^'])*'/ # uninterpo - lated literal
- action : { extract_codeblock($text) } # embedded
- Perl code
- bracket : '(' Item(s) production(s?) ')' # alterna
- tive subrules
- directive : '<commit>' # commit to
- production
- | '<uncommit>' # cancel
- commitment
| '<resync>' # skip to - newline
| '<resync:' pattern '>' # skip - <pattern>
| '<reject>' # fail this - production
| '<reject:' condition '>' # fail if - <condition>
| '<error>' # report an - error
| '<error:' string '>' # report - error as "<string>"
| '<error?>' # error on - ly if committed
| '<error?:' string '>' # " " - " "
| '<rulevar:' /[^>]+/ '>' # define - rule-local variable
| '<matchrule:' string '>' # invoke - rule named in string
- identifier : /[a-z]1128
- comment : /#[^0*/ # same as
- Perl
- pattern : {extract_bracketed($text,'<')} # allow em
- bedded "<..>"
- condition : {extract_codeblock($text,'{<')} # full Perl
- expression
- string : {extract_variable($text)} # any Perl
- variable
- | {extract_quotelike($text)} # or
- quotelike string
| {extract_bracketed($text,'<')} # or bal - anced brackets
GOTCHAS
This section describes common mistakes that grammar writ
ers seem to make on a regular basis.
1. Expecting an error to always invalidate a parse
- A common mistake when using error messages is to write the
grammar like this: - file: line(s)
- line: line_type_1
| line_type_2
| line_type_3
| <error> - The expectation seems to be that any line that is not of
type 1, 2 or 3 will invoke the "<error>" directive and
thereby cause the parse to fail. - Unfortunately, that only happens if the error occurs in
the very first line. The first rule states that a "file"
is matched by one or more lines, so if even a single line
succeeds, the first rule is completely satisfied and the
parse as a whole succeeds. That means that any error mes
sages generated by subsequent failures in the "line" rule
are quietly ignored. - Typically what's really needed is this:
file: line(s) eofile { $return = $item[1] }- line: line_type_1
| line_type_2
| line_type_3
| <error> - eofile: /^
- The addition of the "eofile" subrule to the first produc
tion means that a file only matches a series of successful
"line" matches that consume the complete input text. If any input text remains after the lines are matched, there
must have been an error in the last "line". In that case
the "eofile" rule will fail, causing the entire "file"
rule to fail too. - Note too that "eofile" must match "/^" (end-onfo-tt"/^),
And don't forget the action at the end of the production.
If you just write:
file: line(s) eofile- then the value returned by the "file" rule will be the
value of its last item: "eofile". Since "eofile" always
returns an empty string on success, that will cause the
"file" rule to return that empty string. Apart from
returning the wrong value, returning an empty string will
trip up code such as:
$parser->file($filetext) || die;- (since "" is false).
- Remember that Parse::RecDescent returns undef on failure,
so the only safe test for failure is:
defined($parser->file($filetext)) || die;
DIAGNOSTICS
Diagnostics are intended to be self-explanatory (particu
larly if you use -RD_HINT (under perl -s) or define
$::RD_HINT inside the program).
"Parse::RecDescent" currently diagnoses the following:
- · Invalid regular expressions used as pattern terminals
- (fatal error).
- · Invalid Perl code in code blocks (fatal error).
- · Lookahead used in the wrong place or in a nonsensical
- way (fatal error).
- · "Obvious" cases of left-recursion (fatal error).
- · Missing or extra components in a "<leftop>" or
- "<rightop>" directive.
- · Unrecognisable components in the grammar specification
- (fatal error).
- · "Orphaned" rule components specified before the first
- rule (fatal error) or after an "<error>" directive
(level 3 warning). - · Missing rule definitions (this only generates a level
- 3 warning, since you may be providing them later via
"Parse::RecDescent::Extend()"). - · Instances where greedy repetition behaviour will
- almost certainly cause the failure of a production (a
level 3 warning - see "ON-GOING ISSUES AND FUTURE
DIRECTIONS" below). - · Attempts to define rules named 'Replace' or 'Extend',
- which cannot be called directly through the parser
object because of the predefined meaning of
"Parse::RecDescent::Replace" and "Parse::RecDes
cent::Extend". (Only a level 2 warning is generated,
since such rules can still be used as subrules). - · Productions which consist of a single "<error?>"
- directive, and which therefore may succeed unexpect
edly (a level 2 warning, since this might conceivably
be the desired effect). - · Multiple consecutive lookahead specifiers (a level 1
- warning only, since their effects simply accumulate).
- · Productions which start with a "<reject>" or "<rule
- var:...>" directive. Such productions are optimized
away (a level 1 warning). - · Rules which are autogenerated under $::AUTOSTUB (a
- level 1 warning).
AUTHOR
Damian Conway (damian@conway.org)
BUGS AND IRRITATIONS
There are undoubtedly serious bugs lurking somewhere in
this much code :-) Bug reports and other feedback are most
welcome.
Ongoing annoyances include:
- · There's no support for parsing directly from an input
- stream. If and when the Perl Gods give us regular
expressions on streams, this should be trivial (ahem!)
to implement. - · The parser generator can get confused if actions
- aren't properly closed or if they contain particularly
nasty Perl syntax errors (especially unmatched curly
brackets). - · The generator only detects the most obvious form of
- left recursion (potential recursion on the first sub
rule in a rule). More subtle forms of left recursion
(for example, through the second item in a rule after
a "zero" match of a preceding "zero-or-more" repeti
tion, or after a match of a subrule with an empty pro
duction) are not found. - · Instead of complaining about left-recursion, the gen
- erator should silently transform the grammar to remove
it. Don't expect this feature any time soon as it
would require a more sophisticated approach to parser
generation than is currently used. - · The generated parsers don't always run as fast as
- might be wished.
- · The meta-parser should be bootstrapped using
- "Parse::RecDescent" :-)
ON-GOING ISSUES AND FUTURE DIRECTIONS
- 1. Repetitions are "incorrigibly greedy" in that they
- will eat everything they can and won't backtrack if
that behaviour causes a production to fail needlessly.
So, for example:
rule: subrule(s) subrule - will never succeed, because the repetition will eat
all the subrules it finds, leaving none to match the
second item. Such constructions are relatively rare
(and "Parse::RecDescent::new" generates a warning
whenever they occur) so this may not be a problem,
especially since the insatiable behaviour can be over
come "manually" by writing:
rule: penultimate_subrule(s) subrule- penultimate_subrule: subrule ...subrule
- The issue is that this construction is exactly twice
as expensive as the original, whereas backtracking
would add only 1/N to the cost (for matching N repeti
tions of "subrule"). I would welcome feedback on the
need for backtracking; particularly on cases where the
lack of it makes parsing performance problematical. - 2. Having opened that can of worms, it's also necessary
to consider whether there is a need for non-greedy
repetition specifiers. Again, it's possible (at some
cost) to manually provide the required functionality:
rule: nongreedy_subrule(s) othersubrule- nongreedy_subrule: subrule ...!othersubrule
- Overall, the issue is whether the benefit of this
extra functionality outweighs the drawbacks of further
complicating the (currently minimalist) grammar speci
fication syntax, and (worse) introducing more overhead
into the generated parsers. - 3. An "<autocommit>" directive would be nice. That is, it
would be useful to be able to say:
command: <autocommit>
command: 'find' name
| 'find' address
| 'do' command 'at' time 'if' condition
| 'do' command 'at' time
| 'do' command
| unusual_command- and have the generator work out that this should be
"pruned" thus:
command: 'find' name
| 'find' <commit> address
| 'do' <commit> command <uncommit>
'at' time
'if' <commit> condition- | 'do' <commit> command <uncommit>
'at' <commit> time - | 'do' <commit> command
| unusual_command - There are several issues here. Firstly, should the
"<autocommit>" automatically install an "<uncommit>"
at the start of the last production (on the grounds
that the "command" rule doesn't know whether an
"unusual_command" might start with "find" or "do") or
should the "unusual_command" subgraph be analysed (to
see if it might be viable after a "find" or "do")? - The second issue is how regular expressions should be
treated. The simplest approach would be simply to
uncommit before them (on the grounds that they might
match). Better efficiency would be obtained by analyz
ing all preceding literal tokens to determine whether
the pattern would match them. - Overall, the issues are: can such automated "pruning"
approach a hand-tuned version sufficiently closely to
warrant the extra set-up expense, and (more impor
tantly) is the problem important enough to even war
rant the non-trivial effort of building an automated
solution?
COPYRIGHT
- Copyright (c) 1997-2000, Damian Conway. All Rights
Reserved. This module is free software. It may be used,
redistributed and/or modified under the terms of the Perl
Artistic License - (see http://www.perl.com/perl/misc/Artistic.html)