twig(3)

NAME

XML::Twig - A perl module for processing huge XML docu
ments in tree mode.

SYNOPSIS

Small documents
  my $twig=XML::Twig->new();    # create the twig
  $twig->parsefile( 'doc.xml'); # build it
  my_process( $twig);           # use twig methods to process it
  $twig->print;                 # output the twig
Huge documents
  my $twig=XML::Twig->new(
    twig_handlers =>
      {  title   => sub { $_->set_gi( 'h2' }, # change title tags to h2
        para    => sub { $_->set_gi( 'p') }, # change para
to p
        hidden  => sub { $_->delete;      }, # remove hidden elements
        list    => my_list_process,        # process  list
elements
        div      => sub { $_[0]->flush;    }, # output and
free memory
      },
    pretty_print => 'indented',              # output will
be nicely formatted
    empty_tags    =>  'html',                   #  outputs
<empty_tag />
                        );
    $twig->flush;                            #  flush  the
end of the document
See XML::Twig 101 for other ways to use the module, as a
filter for example

DESCRIPTION

This module provides a way to process XML documents. It is
build on top of XML::Parser.

The module offers a tree interface to the document, while
allowing you to output the parts of it that have been com
pletely processed.

It allows minimal resource (CPU and memory) usage by
building the tree only for the parts of the documents that
need actual processing, through the use of the
"twig_roots" and "twig_print_outside_roots" options. The
"finish" and "finish_print" methods also help to increase
performances.

XML::Twig tries to make simple things easy so it tries its
best to takes care of a lot of the (usually) annoying (but
sometimes necessary) features that come with XML and
XML::Parser.

XML::Twig 101

XML::Twig can be used either on "small" XML documents
(that fit in memory) or on huge ones, by processing parts
of the document and outputting or discarding them once
they are processed.

Loading an XML document and processing it
my $t= XML::Twig->new();
$t->parse( '<d><tit>ti
tle</tit><para>para1</para><para>p2</para></d>');
my $root= $t->root;
$root->set_gi( 'html'); # change doc
to html
$title= $root->first_child( 'tit'); # get the
title
$title->set_gi( 'h1'); # turn it
into h1
my @para= $root->children( 'para'); # get the
para children
foreach my $para (@para)
{ $para->set_gi( 'p'); } # turn them
into p
$t->print; # output the
document
Other useful methods include:
att: "$elt->{'att'}->{'type'}" returns the "type"
attribute for an element,
set_att: "$elt->set_att( type => "important")" sets the
"type" attribute to the "important" value,
next_sibling: "$elt->{next_sibling}" returns the next sib
ling in the document (in the example "$title->{next_sib
ling}" is the first "para" while "$elt->next_sibling(
'table')" is the next "table" sibling
The document can also be transformed through the use of
the cut, copy, paste and move methods: "$title->cut;
$title->paste( 'after', $p);" for example
And much, much more, see Elt.
Processing an XML document chunk by chunk
One of the strengths of XML::Twig is that it let you work
with files that do not fit in memory (BTW storing an XML
document in memory as a tree is quite memory-expensive,
the expansion factor being often around 10).
To do this you can define handlers, that will be called
once a specific element has been completely parsed. In
these handlers you can access the element and process it
as you see fit, using the navigation and the cut-n-paste
methods, plus lots of convenient ones like "prefix". Once
the element is completely processed you can then "flush"
it, which will output it and free the memory. You can also
"purge" it if you don't need to output it (if you are just
extracting some data from the document for example). The
handler will be called again once the next relevant ele
ment has been parsed.

my $t= XML::Twig->new( twig_handlers =>
{ section => section,
para => sub {
$_->set_gi( 'p');
},
);
$t->parsefile( 'doc.xml');
$t->flush; # don't forget to flush one last time
in the end or anything
# after the last </section> tag will
not be output
# the handler is called once a section is com
pletely parsed, ie when
# the end tag for section is found, it receives
the twig itself and
# the element (including all its sub-elements) as
arguments
sub section
{ my( $t, $section)= @_; # arguments for
all twig_handlers
$section->set_gi( 'div'); # change the gi,
my favourite method...
# let's use the attribute nb as a prefix to
the title
my $title= $section->first_child( 'title'); #
find the title
my $nb= $title->{'att'}->{'nb'}; # get the at
tribute
$title->prefix( "$nb - "); # easy isn't it?
$section->flush; # outputs the sec
tion and frees memory
}
my $t= XML::Twig->new( twig_handlers =>
{ 'section/title' =>
print_elt_text} );
$t->parsefile( 'doc.xml');
sub print_elt_text
{ my( $t, $elt)= @_;
print $elt->text;
}
my $t= XML::Twig->new( twig_handlers =>
{ 'section[@level="1"]' =>
print_elt_text }
);
$t->parsefile( 'doc.xml');
There is of course more to it: you can trigger handlers on
more elaborate conditions than just the name of the ele
ment, "section/title" for example. You can also use
"TwigStartHandlers" to process an element as soon as the
start tag is found. Besides "prefix" you can also use
"suffix",
Processing just parts of an XML document
The twig_roots mode builds only the required sub-trees
from the document Anything outside of the twig roots will
just be ignored:

my $t= XML::Twig->new(
# the twig will include just the root and
selected titles
twig_roots => { 'section/title' =>
print_elt_text,
'annex/title' =>
print_elt_text
}
);
$t->parsefile( 'doc.xml');
sub print_elt_text
{ my( $t, $elt)= @_;
print $elt->text; # print the text (includ
ing sub-element texts)
$t->purge; # frees the memory
}
You can use that mode when you want to process parts of a
documents but are not interested in the rest and you don't
want to pay the price, either in time or memory, to build
the tree for the it.
Building an XML filter
You can combine the twig_roots and the twig_print_out
side_roots options to build filters, which let you modify
selected elements and will output the rest of the document
as is.
This would convert prices in $ to prices in Euro in a
document:

my $t= XML::Twig->new(
twig_roots => { 'price' => convert, },
# process prices
twig_print_outside_roots => 1,
# print the rest
);
$t->parsefile( 'doc.xml');
sub convert
{ my( $t, $price)= @_;
my $currency= $price->{'att'}->{'currency'};
# get the currency
if( $currency eq 'USD')
{ $usd_price= $price->text;
# get the price
# %rate is just a conversion table
my $euro_price= $usd_price * $rate{usd2eu
ro};
$price->set_text( $euro_price);
# set the new price
$price->set_att( currency => 'EUR');
# don't forget this!
}
$price->print;
# output the price
}
Simplifying XML processing
Whitespaces
Whitespaces that look non-significant are discarded,
this behaviour can be controlled using the
"keep_spaces", "keep_spaces_in" and "discard_spaces_in
options".
Encoding
You can specify that you want the output in the same
encoding as the input (provided you have valid XML,
which means you have to specify the encoding either in
the document or when you create the Twig object) using
the "keep_encoding" option
Comments and Processing Instructions (PI)
Comments and PI's can be hidden from the processing,
but still appear in the output (they are carried by
the "real" element closer to them)
Pretty Printing
XML::Twig can output the document pretty printed so it
is easier to read for us humans.
Surviving an untimely death
XML parsers are supposed to react violently when fed
improper XML. XML::Parser just dies.
XML::Twig provides the "safe_parse" and the
"safe_parsefile" methods which wrap the parse in an
eval and return either the parsed twig or 0 in case of
failure.
Private attributes
Attributes with a name starting with # (illegal in
XML) will not be output, so you can safely use them to
store temporary values during processing.

METHODS

Twig

A twig is a subclass of XML::Parser, so all XML::Parser
methods can be called on a twig object, including parse
and parsefile. setHandlers on the other hand cannot be
used, see "BUGS"

new This is a class method, the constructor for XML::Twig.
Options are passed as keyword value pairs. Recognized
options are the same as XML::Parser, plus some
XML::Twig specifics:
twig_handlers
This argument replaces the corresponding
XML::Parser argument. It consists of a hash {
expression => handler} where expression is a
generic_attribute_condition, string_condition, an attribute_condition,full_path, a partial_path, a gi, _default_ or <_all_>.
The idea is to support a usefull but efficient
(thus limited) subset of XPATH. A fuller expres
sion set will be supported in the future, as users
ask for more and as I manage to implement it effi
ciently. This will never encompass all of XPATH
due to the streaming nature of parsing (no looka
head after the element end tag).
A generic_attribute_condition is a condition on an attribute, in the form *[@att="val"] or *[@att], simple quotes can be used instead of double quotes
and the leading '*' is actually optional. No mat
ter what the gi of the element is, the handler
will be triggered either if the attribute has the
specified value or if it just exists.
A string_condition is a condition on the content of an element, in the form gi[string()="foo"], simple quotes can be used instead of double
quotes, at the moment you cannot escape the quotes
(this will be added as soon as I dig out my copy
of Mastering Regular Expressions from its storage
box). The text returned is, as per what I (and
Matt Sergeant!) understood from the XPATH spec the
concatenation of all the text in the element,
excluding all markup. Thus to call a handler on
the element <p>text <b>bold</b></p> the appropri
ate condition is p[string()="text bold"]. Note that this is not exactly conformant to the XPATH
spec, it just tries to mimic it while being still
quite concise.
A extension of that notation is
gi[string(cchhiilldd__ggii)="foo"] where the handler will be called if a child of a "gi" element has a text
value of "foo". At the moment only direct chil
dren of the "gi" element are checked. If you need
to test on descendants of the element let me know.
The fix is trivial but would slow down the checks,
so I'd like to keep it the way it is.
A regexp_condition is a condition on the content of an element, in the form gi[string()=~ /foo/"]. This is the same as a string condition except that
the text of the element is matched to the regexp.
The "i", "m", <s> and "o" modifiers can be used on
the regexp.
The gi[string(cchhiilldd__ggii)=~ /foo/"] extension is also supported.
An attribute_condition is a simple condition of an attribute of the current element in the form
gi[@att="val"] (simple quotes can be used instead of double quotes, you can escape quotes either).
If several attribute_condition are true the same
element all the handlers can be called in turn (in
the order in which they were first defined). If
the ="val" part is ommited ( the condition is then
gi[@att]) then the handler is triggered if the
attribute actually exists for the element, no mat
ter what it's value is.
A full_path looks like '/doc/section/chap_ ter/title', it starts with a / then gives all the gi's to the element. The handler will be called if
the path to the current element (in the input doc
ument) is exactly as defined by the full_path.
A partial_path is like a full_path except it does not start with a /: 'chapter/title' for example. The handler will be called if the path to the ele
ment (in the input document) ends as defined in
the partial_path.
WARNING: (hopefully temporary) at the moment
string_condition, regexp_condition and attribute_condition are only supported on a simple gi, not on a path.
A gi (generic identifier) is just a tag name.
#CDATA and #ENT can be used to call a handler for
a CDATA section or an entity respectively.
A special gi _all_ is used to call a function for
each element. The special gi _default_ is used to call a handler for each element that does NOT have
a specific handler.
The order of precedence to trigger a handler is:
generic_attribute_condition, string_condition, regexp_condition, attribute_condition, full_path, longer partial_path, shorter partial_path, gi, _default_ .
Important: once a handler has been triggered if it returns 0 then no other handler is called, exept a
_all_ handler which will be called anyway.
If a handler returns a true value and other han
dlers apply, then the next applicable handler will
be called. Repeat, rince, lather..;
When an element is CLOSED the corresponding han
dler is called, with 2 arguments: the twig and the
"/Element". The twig includes the document tree
that has been built so far, the element is the
complete sub-tree for the element. $_ is also set
to the element.
Text is stored in elements where gi is #PCDATA
(due to mixed content, text and sub-element in an
element there is no way to store the text as just
an attribute of the enclosing element).
Warning: if you have used purge or flush on the
twig the element might not be complete, some of
its children might have been entirely flushed or
purged, and the start tag might even have been
printed (by flush) already, so changing its gi
might not give the expected result.
More generally, the full_path, partial_path and gi expressions are evaluated against the input docu
ment. Which means that even if you have changed
the gi of an element (changing the gi of a parent
element from a handler for example) the change
will not impact the expression evaluation.
Attributes in attribute_condition are different though. As the initial value of attribute is not
stored the handler will be triggered if the cur
rent attribute/value pair is found when the ele
ment end tag is found. Although this can be quite
confusing it should not impact most of users, and
allow others to play clever tricks with temporary
attributes. Let me know if this is a problem for
you.
twig_roots
This argument let's you build the tree only for
those elements you are interested in.

Example: my $t= XML::Twig->new( twig_roots => {
title => 1, subtitle => 1});
$t->parsefile( file);
my $t= XML::Twig->new( twig_roots => {
'section/title' => 1});
$t->parsefile( file);
returns a twig containing a document including
only title and subtitle elements, as children of
the root element.
You can use generic_attribute_condition, attribute_condition, full_path, partial_path, gi, _default_ and _all_ to trigger the building of the twig. string_condition and regexp_condition can not be used as the content of the element, and the
string, have not yet been parsed when the condi
tion is checked.
WARNING: path are checked for the document. Even if the twig_roots option is used they will be
checked against the full document tree, not the
virtual tree created by XML::Twig
WARNING: twig_roots elements should NOT be nested, that would hopelessly confuse XML::Twig ;--(
Note: you can set handlers (twig_handlers) using
twig_roots
Example: my $t= XML::Twig->new( twig_roots =>
{ title =>
sub { $_{1]->print;},
subtitle =>
process_subtitle
}
);
$t->parsefile( file);
twig_print_outside_roots
To be used in conjunction with the twig_roots
argument. When set to a true value this will print
the document outside of the twig_roots elements.

Example: my $t= XML::Twig->new( twig_roots => {
title => number_title },
twig_print_out
side_roots => 1,
);
$t->parsefile( file);
{ my $nb;
sub number_title
{ my( $twig, $title);
$nb++;
$title->prefix( "$nb "; }
$title->print;
}
}
This example prints the document outside of the
title element, calls number_title for each title
element, prints it, and then resumes printing the
document. The twig is built only for the title
elements.
If the value is a reference to a file handle then
the document outside the twig_roots elements will
be output to this file handle:

open( OUT, ">out_file") or die "cannot open out
file out_file:$!";
my $t= XML::Twig->new( twig_roots => { title =>
number_title },
# default output to OUT
twig_print_outside_roots
=> UT,
);
{ my $nb;
sub number_title
{ my( $twig, $title);
$nb++;
$title->prefix( "$nb "; }
$title->print( UT); # you have
to print to UT here
}
}
start_tag_handlers
A hash { expression => handler}. Sets element
handlers that are called when the element is open
(at the end of the XML::Parser Start handler). The
handlers are called with 2 params: the twig and
the element. The element is empty at that point,
its attributes are created though.
You can use generic_attribute_condition, attribute_condition, full_path, partial_path, gi, _default_ and _all_ to trigger the handler.
string_condition and regexp_condition cannot be used as the content of the element, and the
string, have not yet been parsed when the condi
tion is checked.
The main uses for those handlers are to change the
tag name (you might have to do it as soon as you
find the open tag if you plan to "flush" the twig
at some point in the element, and to create tempo
rary attributes that will be used when processing
sub-element with TwigHanlders.
You should also use it to change tags if you use
flush. If you change the tag in a regular TwigHan
dler then the start tag might already have been
flushed.
Note: StartTag handlers can be called outside ot
twig_roots if this argument is used, in this case handlers are called with the following arguments:
$t (the twig), $gi (the gi of the element) and
%att (a hash of the attributes of the element).
If the twig_print_outside_roots argument is also used then the start tag will be printed if the
last handler called returns a "true" value, if it
does not then the start tag will not be printed
(so you can print a modified string yourself for
example);
Note that you can use the ignore method in
start_tag_handlers (and only there).
end_tag_handlers
A hash { expression => handler}. Sets element
handlers that are called when the element is
closed (at the end of the XML::Parser End han
dler). The handlers are called with 2 params: the
twig and the gi of the element.
twig_handlers are called when an element is com pletely parsed, so why have this redundant option?
There is only one use for end_tag_handlers: when
using the twig_roots option, to trigger a handler
for an element outside the roots. It is for exam ple very useful to number titles in a document
using nested sections:

my @no= (0);
my $no;
my $t= XML::Twig->new(
start_tag_handlers =>
{ section => sub { $no[$#no]++; $no=
join '.', @no; push @no, 0; } },
twig_roots =>
{ title => sub { $_[1]->prefix( $no);
$_[1]->print; } },
end_tag_handlers => { section => sub {
pop @no; } },
twig_print_outside_roots => 1
);
$t->parsefile( $file);
Using the end_tag_handlers argument without
twig_roots will result in an error.
ignore_elts
This option lets you ignore elements when building
the twig. This is useful in cases where you cannot
use twig_roots to ignore elements, for example if
the element to ignore is a sibling of elements you
are interested in.
Example:

my $twig= XML::Twig->new( ignore_elts => { elt
=> 1 });
$twig->parsefile( 'doc.xml');
This will build the complete twig for the docu
ment, except that all "elt" elements (and their
children) will be left out.
char_handler
A reference to a subroutine that will be called
every time PCDATA is found.
keep_encoding
This is a (slightly?) evil option: if the XML doc
ument is not UTF-8 encoded and you want to keep it
that way, then setting keep_encoding will use the
Expat original_string method for character, thus
keeping the original encoding, as well as the
original entities in the strings.
See the t/test6.t test file to see what results
you can expect from the various encoding options.
WARNING: if the original encoding is multi-byte
then attribute parsing will be EXTREMELY unsafe
under any Perl before 5.6, as it uses regular
expressions which do not deal properly with multibyte characters. You can specify an alternate
function to parse the start tags with the
parse_start_tag option (see below)
WARNING: this option is NOT used when parsing with the non-blocking parser (parse_start, parse_more,
parse_done methods) which you probably should not
use with XML::Twig anyway as they are totally
untested!
output_encoding
This option generates an output_filter using
Text::Iconv or Unicode::Map8 and Unicode::Strings,
and sets the encoding in the XML declaration. This
is the easiest way to deal with encodings, if you
need more sophisticated features, look at out
put_filter below
output_filter
This option is used to convert the character
encoding of the output document. It is passed
either a string corresponding to a predefined fil
ter or a subroutine reference. The filter will be
called every time a document or element is pro
cessed by the "print" functions ("print",
"sprint", "flush").
Pre-defined filters are:
latin1
uses either Text::Iconv or Unicode::Map8 and
Unicode::String or a regexp (which works only
with XML::Parser 2.27), in this order, to con
vert all characters to ISO-8859-1 (aka latin1)
html
does the same conversion as latin1, plus
encodes entities using HTML::Entities (you
need to have HTML::Entities intalled for it to
be available). This should only be used if the
tags and attribute names themselves are in
US-ASCII, or they will be converted and the
output will not be valid XML any more
safe
converts the output to ASCII (US) only plus
character entities (&#nnn;) this should be
used only if the tags and attribute names
themselves are in US-ASCII, or they will be
converted and the output will not be valid XML
any more
iconv_convert ($encoding)
this function is used to create a filter sub
routine that will be used to convert the char
acters to the target encoding using
Text::Iconv (which need to be installed, look
at the documentation for the module and for
the "iconv" library to find out which encod
ings are available on your system)

my $conv = XML::Twig::iconv_convert(
'latin1');
my $t = XML::Twig->new(output_filter =>
$conv);
unicode_convert ($encoding)
this function is used to create a filter sub
routine that will be used to convert the char
acters to the target encoding using Uni
code::Strings and Unicode::Map8 (which need to
be installed, look at the documentation for
the modules to find out which encodings are
available on your system)

my $conv = XML::Twig::unicode_convert(
'latin1');
my $t = XML::Twig->new(output_filter =>
$conv);
Note that the "text" and "att" methods do not use
the filter, so their result are always in unicode.
input_filter
This option is similar to output_filter except the
filter is applied to the characters before they
are stored in the twig, at parsing time.
parse_start_tag
If you use the keep_encoding option then this
option can be used to replace the default parsing
function. You should provide a coderef (a refer
ence to a subroutine) as the argument, this sub
routine takes the original tag (given by
XML::Parser::Expat original_string() method) and returns a gi and the attributes in a hash (or in a
list attribute_name/attribute value).
expand_external_ents
When this option is used external entities (that
are defined) are expanded when the document is
output using "print" functions such as "Lprint">,
"sprint", "flush" and "xml_string". Note that in
the twig the entity will be stored as an element
whith a gi '#ENT', the entity will not be expanded
there, so you might want to process the entities
before outputting it.
load_DTD
If this argument is set to a true value, parse or
parsefile on the twig will load the DTD informa
tion. This information can then be accessed
through the twig, in a DTD_handler for example.
This will load even an external DTD.
Note that to do this the module will generate a
temporary file in the current directory. If this
is a problem let me know and I will add an option
to specify an alternate directory.
See DTD Handling for more information
DTD_handler
Sets a handler that will be called once the doc
type (and the DTD) have been loaded, with 2 argu
ments, the twig and the DTD.
id This optional argument gives the name of an
attribute that can be used as an ID in the docu
ment. Elements whose ID is known can be accessed
through the elt_id method. id defaults to 'id'.
See "BUGS"
discard_spaces
If this optional argument is set to a true value
then spaces are discarded when they look non-sig
nificant: strings containing only spaces are dis
carded. This argument is set to true by default.
keep_spaces
If this optional argument is set to a true value
then all spaces in the document are kept, and
stored as PCDATA. keep_spaces and discard_spaces
cannot be both set.
discard_spaces_in
This argument sets keep_spaces to true but will
cause the twig builder to discard spaces in the
elements listed. The syntax for using this argu
ment is:
XML::Twig->new( discard_spaces_in => [ 'elt1',
'elt2']);
keep_spaces_in
This argument sets discard_spaces to true but will
cause the twig builder to keep spaces in the ele
ments listed. The syntax for using this argument
is:
XML::Twig->new( keep_spaces_in => [ 'elt1',
'elt2']);
PrettyPrint
Sets the pretty print method, amongst 'none'
(default), 'nsgmls', 'nice', 'indented',
'indented_c', 'record' and 'record_c'
none
The document is output as one ling string,
with no linebreaks except those found within
text elements
nsgmls
Line breaks are inserted in safe places: that
is within tags, between a tag and an
attribute, between attributes and before the >
at the end of a tag.
This is quite ugly but better than "none", and
it is very safe, the document will still be
valid (conforming to its DTD).
This is how the SGML parser "sgmls" splits
documents, hence the name.
nice
This option inserts line breaks before any tag
that does not contain text (so element with
textual content are not broken as the is
the significant).
WARNING: this option leaves the document wellformed but might make it invalid (not confor
mant to its DTD). If you have elements
declared as

<!ELEMENT foo (#PCDATA|bar)>
then a "foo" element including a "bar" one
will be printed as

<foo>
<bar>bar is just pcdata</bar>
</foo>
This is invalid, as the parser will take the
line break after the foo tag as a sign that
the element contains PCDATA, it will then die
when it finds the "bar" tag. This may or may
not be important for you, but be aware of it!
indented
Same as "nice" (and with the same warning) but
indents elements according to their level
indented_c
Same as "indented" but a little more compact:
the closing tags are on the same line as the
preceeding text
record
This is a record_oriented pretty print, that
display data in records, one field per line
(which looks a LOT like "indented")
record_c
Stands for record compact, one record per line
EmptyTags
Sets the empty tag display style (normal, html or
expand).
comments
Sets the way comments are processed: drop
(default), keep or process
drop
drops the comments, they are not read, nor
printed to the output
keep
comments are loaded and will appear on the
output, they are not accessible within the
twig and will not interfere with processing
though
Bug: comments in the middle of a text element
such as

<p>text <!-- comment --> more text --></p>
are output at the end of the text:

<p>text more text <!-- comment --></p>
process
comments are loaded in the twig and will be
treated as regular elements (their "gi" is
"#COMMENT") this can interfere with processing
if you expect "$elt->{first_child}" to be an
element but find a comment there. Validation
will not protect you from this as comments can
happen anywhere. You can use
"$elt->first_child( 'gi')" (which is a good
habit anyway) to get where you want. Consider
using
pi Sets the way processing instructions are pro
cessed: "drop", "keep" (default) or "process"
Note that you can also set PI handlers in the
twig_handlers option:

'?' => handler
'?target' => handler 2
The handlers will be called with 2 parameters, the
twig and the PI element if pi is set to "process",
and with 3, the twig, the target and the data if
pi is set to "keep". Of course they will not be
called if PI is set to "drop".
If pi is set to "keep" the handler should return a
string that will be used as-is as the PI text (it
should look like "" <?target data?" >" or '' if
you want to remove the PI),
Only one handler will be called, "?target" or "?"
if no specific handler for that target is avail
able.
Note: I _HATE_ the Java-like name of arguments used by
most XML modules. As XML::Twig is based on
XML::Parser I kept the style, but you can also use a
more perlish naming convention, using "twig_print_out
side_roots" instead of "twig_print_outside_roots" or
"pretty_print" instead of "PrettyPrint", XML::Twig
then normalizes all the argument names.
parse(SOURCE [, OPT => OPT_VALUE [...]])
This method is inherited from XML::Parser. The SOURCE
parameter should either be a string containing the
whole XML document, or it should be an open IO::Han
dle. Constructor options to XML::Parser::Expat given
as keyword-value pairs may follow the SOURCE parame
ter. These override, for this call, any options or
attributes passed through from the XML::Parser
instance.
A die call is thrown if a parse error occurs. Other
wise it will return the twig built by the parse. Use
safe_parse if you want the parsing to return even when an error occurs.
parsestring
This is just an alias for parse for backwards compati
bility.
parsefile(FILE [, OPT => OPT_VALUE [...]])
This method is inherited from XML::Parser.
Open FILE for reading, then call parse with the open
handle. The file is closed no matter how parse
returns.
A die call is thrown if a parse error occurs. Other
wise it will return the twig built by the parse. Use
safe_parsefile if you want the parsing to return even when an error occurs.
parseurl $url $optionnal_user_agent
Gets the data from the url and parse it. Note that the
data is piped to the parser in chunks the size of the
XML::Parser::Expat buffer, so memory consumption and
hopefully speed are optimal.
If the $optionnal_user_agent argument is used then it
is used, otherwise a new one is created.
safe_parse( SOURCE [, OPT => OPT_VALUE [...]])
This method is similar to parse except that it wraps
the parsing in an eval block. It returns the twig on
success and 0 on failure (the twig object also con
tains the parsed twig). $@ contains the error message
on failure.
Note that the parsing still stops as soon as an error
is detected, there is no way to keep going after an
error.
safe_parsefile(FILE [, OPT => OPT_VALUE [...]])
This method is similar to parsefile except that it wraps the parsing in an eval block. It returns the
twig on success and 0 on failure (the twig object also
contains the parsed twig) . $@ contains the error mes
sage on failure
Note that the parsing still stops as soon as an error
is detected, there is no way to keep going after an
error.
safe_parseurl $url $optional_user_agent
Same as parseurl except that it wraps the parsing in an eval block. It returns the twig on success and 0 on
failure (the twig object also contains the parsed
twig) . $@ contains the error message on failure
parser
This method returns the expat object (actually the
XML::Parser::Expat object) used during parsing. It is
useful for example to call XML::Parser::Expat methods
on it. To get the line of a tag for example use
$t->parser->current_line.
setTwigHandlers ($handlers)
Set the Twig handlers. $handlers is a reference to a
hash similar to the one in the TwigHandlers option of
new. All previous handlers are unset. The method
returns the reference to the previous handlers.
setTwigHandler ($gi $handler)
Set a single Twig handlers for the $gi element. $han
dler is a reference to a subroutine. If the handler
was previously set then the reference to the previous
handler is returned.
setStartTagHandlers ($handlers)
Set the StartTag handlers. $handlers is a reference to
a hash similar to the one in the start_tag_handlers
option of new. All previous handlers are unset. The
method returns the reference to the previous handlers.
setStartTagHandler ($gi $handler)
Set a single StartTag handlers for the $gi element.
$handler is a reference to a subroutine. If the han
dler was previously set then the reference to the pre
vious handler is returned.
setEndTagHandlers ($handlers)
Set the EndTag handlers. $handlers is a reference to a
hash similar to the one in the end_tag_handlers option
of new. All previous handlers are unset. The method
returns the reference to the previous handlers.
setEndTagHandler ($gi $handler)
Set a single EndTag handlers for the $gi element.
$handler is a reference to a subroutine. If the han
dler was previously set then the reference to the pre
vious handler is returned.
setTwigHandlers ($handlers)
Set the Twig handlers. $handlers is a reference to a
hash similar to the one in the twig_handlers option of
new.
dtd Returns the dtd (an XML::Twig::DTD object) of a twig
root
Returns the root element of a twig
set_root ($elt)
Sets the root of a twig
first_elt ($optionnal_gi)
Returns the first element whose gi is $optionnal_gi of
a twig, if no $optionnal_gi is given then the root is
returned
elt_id ($id)
Returns the element whose id attribute is $id
encoding
This method returns the encoding of the XML document,
as defined by the encoding attribute in the XML decla
ration (ie it is "undef" if the attribute is not
defined)
set_encoding
This method sets the value of the encoding attribute
in the XML declaration. Note that if the document did
not have a declaration it is generated (with an XML
version of 1.0)
xml_version
This method returns the XML version, as defined by the
version attribute in the XML declaration (ie it is
"undef" if the attribute is not defined)
set_xml_version
This method sets the value of the version attribute in
the XML declaration. If the declaration did not exist
it is created.
standalone
This method returns the value of the standalone decla
ration for the document
set_standalone
This method sets the value of the standalone attribute
in the XML declaration. Note that if the document did
not have a declaration it is generated (with an XML
version of 1.0)
entity_list
Returns the entity list of a twig
change_gi ($old_gi, $new_gi)
Performs a (very fast) global change. All elements
old_gi are now new_gi. See "BUGS"
flush ($optional_filehandle, $options)
Flushes a twig up to (and including) the current ele
ment, then deletes all unnecessary elements from the
tree that's kept in memory. flush keeps track of
which elements need to be open/closed, so if you flush
from handlers you don't have to worry about anything.
Just keep flushing the twig every time you're done
with a sub-tree and it will come out well-formed.
After the whole parsing don't forget to flush one more
time to print the end of the document. The doctype
and entity declarations are also printed.
flush take an optional filehandle as an argument.
options: use the Update_DTD option if you have updated
the (internal) DTD and/or the entity list and you want
the updated DTD to be output
The PrettyPrint option sets the pretty printing of the
document.

Example: $t->flush( Update_DTD => 1);
$t->flush( ILE, Update_DTD => 1);
$t->flush( ILE);
flush_up_to ($elt, $optionnal_filehandle, %options)
Flushes up to the $elt element. This allows you to
keep part of the tree in memory when you flush.
options: see flush.
purge
Does the same as a flush except it does not print the
twig. It just deletes all elements that have been com
pletely parsed so far.
purge_up_to ($elt)
Purges up to the $elt element. This allows you to keep
part of the tree in memory when you flush.
print ($optional_filehandle, %options)
Prints the whole document associated with the twig. To
be used only AFTER the parse.
options: see flush.
sprint
Returns the text of the whole document associated with
the twig. To be used only AFTER the parse.
options: see flush.
ignore
This method can only be called in start_tag_handlers.
It causes the element to be skipped during the pars
ing: the twig is not built for this element, it will
not be accessible during parsing or after it. The ele
ment will not take up any memory and parsing will be
faster.
Note that this method can also be called on an ele
ment. If the element is a parent of the current ele
ment then this element will be ignored (the twig will
not be built any more for it and what has already been
built will be deleted)
set_pretty_print ($style)
Sets the pretty print method, amongst 'none'
(default), 'nsgmls', 'nice', 'indented', 'record' and
rec'record'ord_c
WARNING: the pretty print style is a GLOBAL variable, so once set it's applied to ALL print's (and
sprint's). Same goes if you use XML::Twig with
mod_perl . This should not be a problem as the XML
that's generated is valid anyway, and XML processors
(as well as HTML processors, including browsers)
should not care. Let me know if this is a big problem,
but at the moment the performance/cleanliness tradeoff clearly favors the global approach.
set_empty_tag_style ($style)
Sets the empty tag display style (normal, html or
expand). As with set_pretty_print this sets a global
flag.
normal outputs an empty tag '<tag/>', html adds a
space '<tag /> and expand outputs '<tag></tag>'
print_prolog ($optional_filehandle, %options)
Prints the prolog (XML declaration + DTD + entity dec
larations) of a document.
options: see flush.
prolog ($optional_filehandle, %options)
Returns the prolog (XML declaration + DTD + entity
declarations) of a document.
options: see flush.
finish
Call Expat finish method. Unsets all handlers
(including internal ones that set context), but expat
continues parsing to the end of the document or until
it finds an error. It should finish up a lot faster
than with the handlers set.
finish_print
Stop twig processing, flush the twig and proceed to
finish printing the document as fast as possible. Use
this method when modifying a document and the
modification is done.
Methods inherited from XML::Parser::Expat
A twig inherits all the relevant methods from
XML::Parser::Expat. These methods can only be used
during the parsing phase (they will generate a fatal
error otherwise).
Inherited methods are:

depth in_element within_element context
current_line current_column current_byte posi
tion_in_context
base current_element element_index
namespace eq_name generate_ns_name new_ns_prefixes
expand_ns_prefix current_ns_prefixes
recognized_string original_string
xpcroak xpcarp
path($gi)
Returns the element context in a form similar to
XPath's short form: '/root/gi1/../gi'
get_xpath ( $optionnal_array_ref, $xpath, $optional_off
set)
Performs a get_xpath on the document root (see
<Elt|"Elt">)
If the $optionnal_array_ref argument is used the array
must contain elements. The $xpath expression is
applied to each element in turn and the result is
union of all results. This way a first query can be
refined in further steps.
find_nodes
same as get_xpath
dispose
Useful only if you don't have WeakRef installed.
Reclaims properly the memory used by an XML::Twig
object. As the object has circular references it never
goes out of scope, so if you want to parse lots of XML
documents then the memory leak becomes a problem. Use
$twig->dispose to clear this problem.
Elt
print ($optional_filehandle,
$optional_pretty_print_style)
Prints an entire element, including the tags, option
ally to a $optional_filehandle, optionally with a
$pretty_print_style.
The print outputs XML data so base entities are
escaped.
sprint ($elt, $optional_no_enclosing_tag)
Returns the xml string for an entire element, includ
ing the tags. If the optional second argument is true
then only the string inside the element is returned
(the start and end tag for $elt are not). The text is
XML-escaped: base entities (& and < in text, & < and "
in attribute values) are turned into entities.
gi Returns the gi of the element (the gi is the "generic
identifier" the tag name in SGML parlance).
tag Same as gi
set_gi ($gi)
Sets the gi (tag) of an element
set_tag ($gi)
Sets the tag (=gi) of an element
root
Returns the root of the twig in which the element is
contained.
twig
Returns the twig containing the element.
parent ($optional_cond)
Returns the parent of the element, or the first ances
tor matching the cond
first_child ($optional_cond)
Returns the first child of the element, or the first
child matching the cond
first_child_text ($optional_cond)
Returns the text of the first child of the element, or
the first child If there is no first_child then
returns ''. This avoids getting the child, checking
for its existence then getting the text for trivial
cases.
Similar methods are available for the other navigation
methods: "last_child_text", "prev_sibling_text",
"next_sibling_text", "prev_elt_text", "next_elt_text",
"child_text", "parent_text"
field ($optional_cond)
Same method as first_child_text with a different name
first_child_matches ($optional_cond)
Returns the element if the first child of the element
(if it exists) passes the $cond, "undef" otherwise

if( $elt->first_child_matches( 'title')) ...
is equivalent to

if( $elt->{first_child} &&
$elt->{first_child}->passes( 'title'))
"first_child_is" is an other name for this method
Similar methods are available for the other navigation
methods: "last_child_matches", "prev_sibling_matches",
"next_sibling_matches", "prev_elt_matches",
"next_elt_matches", "child_matches", "parent_matches"
prev_sibling ($optional_cond)
Returns the previous sibling of the element, or the
previous sibling matching cond
next_sibling ($optional_cond)
Returns the next sibling of the element, or the first
one matching cond.
next_elt ($optional_elt, $optional_cond)
Returns the next elt (optionally matching cond) of the
element. This is defined as the next element which
opens after the current element opens. Which usually
means the first child of the element. Counter-intu
itive as it might look this allows you to loop through
the whole document by starting from the root.
The $optional_elt is the root of a subtree. When the
next_elt is out of the subtree then the method returns
undef. You can then walk a sub tree with:

my $elt= $subtree_root;
while( $elt= $elt->next_elt( $subtree_root)
{ # insert processing code here
}
prev_elt ($optional_cond)
Returns the previous elt (optionally matching cond) of
the element. This is the first element which opens
before the current one. It is usually either the last
descendant of the previous sibling or simply the par
ent
children ($optional_cond)
Returns the list of children (optionally which matches
cond) of the element. The list is in document order.
descendants ($optional_cond)
Returns the list of all descendants (optionally which
matches cond) of the element. This is the equivalent
of the getElementsByTagName of the DOM (by the way, if
you are really a DOM addict, you can use "getElements
ByTagName" instead)
descendants_or_self ($optional_cond)
Same as descendants except that the element itself is
included in the list if it matches the $optional_cond
ancestors ($optional_cond)
Returns the list of ancestors (optionally matching
cond) of the element. The list is ordered from the
innermost ancestor to the outtermost one
NOTE: the element itself is not part of the list, in
order to include it you will have to write:

my @array= ($elt, $elt->ancestors)
att ($att)
Returns the attribute value or "undef"
set_att ($att, $att_value)
Sets the attribute of the element to the given value
You can actually set several attributes this way:

$elt->set_att( att1 => "val1", att2 => "val2");
del_att ($att)
Delete the attribute for the element
You can actually delete several attributes at once:

$elt->del_att( 'att1', 'att2', 'att3');
cut Cuts the element from the tree. The element still
exists, it can be copied or pasted somewhere else, it
is just not attached to the tree anymore.
copy ($elt)
Returns a copy of the element. The copy is a "deep"
copy: all sub elements of the element are duplicated.
paste ($optional_position, $ref)
Pastes a (previously cut or newly generated) element.
Dies if the element already belongs to a tree.
The optional position element can be:
first_child (default)
The element is pasted as the first child of the
element object this method is called on.
last_child
The element is pasted as the last child of the
element object this method is called on.
before
The element is pasted before the element object,
as its previous sibling.
after
The element is pasted after the element object, as
its next sibling.
within
In this case an extra argument, $offset, should be
supplied. The element will be pasted in the refer
ence element (or in its first text child) at the
given offset. To achieve this the reference ele
ment will be split at the offset.
move ($optional_position, $ref)
Move an element in the tree. This is just a cut then
a paste. The syntax is the same as paste.
replace ($ref)
Replaces an element in the tree. Sometimes it is just
not possible to cut an element then paste another in
its place, so replace comes in handy.
delete
Cut the element and frees the memory.
prefix ($text, $optional_option)
Add a prefix to an element. If the element is a PCDATA
element the text is added to the pcdata, if the ele
ments first_child is a PCDATA then the text is added
to it's pcdata, otherwise a new PCDATA element is cre
ated and pasted as the first child of the element.
If the option is "asis" then the prefix is added asis:
it is created in a separate PCDATA element with an
asis property. You can then write:

$elt1->prefix( '<b>', 'asis');
to create a " <b" > in the output of "print".
suffix ($text, $optional_option)
Add a suffix to an element. If the element is a PCDATA
element the text is added to the pcdata, if the ele
ments last_child is a PCDATA then the text is added to
it's pcdata, otherwise a new PCDATA element is created
and pasted as the last child of the element.
If the option is "asis" then the suffix is added asis:
it is created in a separate PCDATA element with an
asis property. You can then write:

$elt2->suffix( '<b>', 'asis');
split_at ($offset)
Split a text ("PCDATA" or "CDATA") element in 2 at
$offset, the original element now holds the first part
of the string and a new element holds the right part.
The new element is returned
If the element is not a text element then the first
text child of the element is split
split ( $optional_regexp, $optional_tag,
$optional_attribute_ref)
Split the text descendants of an element in place, the
text is split using the regexp, if the regexp includes
() then the matched separators will be wrapped in
$optional_tag, with $optional_attribute_ref attributes
if $elt is "<p>tati tata <b>tutu tati titi</b> tata
tati tata</p>"

$elt->split( qr/(ta)ti/, 'foo', {type => 'toto'} )
will change $elt to

<p><foo type="toto">ta</foo> tata <b>tutu <foo
type="toto">ta</foo>
titi</b> tata <foo type="toto">ta</foo> tata</p>
The regexp can be passed either as a string or as qr//
(perl 5.005 and later), it defaults to just as the
"split" built-in (but this would be quite a useless
behaviour without the $optional_tag parameter)
$optional_tag defaults to PCDATA or CDATA, depending
on the initial element type
The list of descendants is returned (including untouched original elements and newly created ones)
mark ( $regexp, $optional_tag,
$optional_attribute_ref)
This method behaves exactly as split, except only the
newly created elements are returned
new ($optional_gi, $optional_atts, @optional_con
tent)
The gi is optional (but then you can't have a content
), the optional atts is the ref of a hash of
attributes, the content can be just a string or a list
of strings and element. A content of '#EMPTY' creates
an empty element;

Examples: my $elt= XML::Twig::Elt->new();
my $elt= XML::Twig::Elt->new( 'para', {
align => 'center' });
my $elt= XML::Twig::Elt->new( 'para', {
align => 'center' }, 'foo');
my $elt= XML::Twig::Elt->new( 'br', '#EMP
TY');
my $elt= XML::Twig::Elt->new( 'para');
my $elt= XML::Twig::Elt->new( 'para', 'this
is a para');
my $elt= XML::Twig::Elt->new( 'para',
$elt3, 'another para');
The strings are not parsed, the element is not
attached to any twig.
WARNING: if you rely on ID's then you will have to set the id yourself. At this point the element does not
belong to a twig yet, so the ID attribute is not known
so it won't be strored in the ID list.
parse ($string, %args)
Creates an element from an XML string. The string is
actually parsed as a new twig, then the root of that
twig is returned. The arguments in %args are passed
to the twig. As always if the parse fails the parser
will die, so use an eval if you want to trap syntax
errors.
As obviously the element does not exist beforehand
this method has to be called on the class:

my $elt= parse XML::Twig::Elt( "<a> string to parse,
with <sub/>
<elements>, actually
tons of </elements>
h</a>");
get_xpath ($xpath, $optional_offset)
Returns a list of elements satisfying the $xpath.
$xpath is an XPATH-like expression.
A subset of the XPATH abbreviated syntax is covered:

gi
gi[1] (or any other positive number)
gi[last()]
gi[@att] (the attribute exists for the element)
gi[@att="val"]
gi[@att=~ /regexp/]
gi[att1="val1" and att2="val2"]
gi[att1="val1" or att2="val2"]
gi[string()="toto"] (returns gi elements which text
(as per the text method)
is toto)
gi[string()=~/regexp/] (returns gi elements which
text (as per the text
method) matches regexp)
expressions can start with / (search starts at the
document root)
expressions can start with . (search starts at the
current element)
// can be used to get all descendants instead of
just direct children
* matches any gi
So the following examples from the XPATH recommenda
tion (http://www.w3.org/TR/xpath.html#path-abbrev)
work:

para selects the para element children of the con
text node
* selects all element children of the context node
para[1] selects the first para child of the context
node
para[last()] selects the last para child of the con
text node
*/para selects all para grandchildren of the context
node
/doc/chapter[5]/section[2] selects the second sec
tion of the fifth chapter
of the doc
chapter//para selects the para element descendants
of the chapter element
children of the context node
//para selects all the para descendants of the docu
ment root and thus selects
all para elements in the same document as the
context node
//olist/item selects all the item elements in the
same document as the
context node that have an olist parent
.//para selects the para element descendants of the
context node
.. selects the parent of the context node
para[@type="warning"] selects all para children of
the context node that have
a type attribute with value warning
employee[@secretary and @assistant] selects all the
employee children of the
context node that have both a secretary attribute
and an assistant
attribute
The elements will be returned in the document order.
If $optional_offset is used then only one element will
be returned, the one with the appropriate offset in
the list, starting at 0
Quoting and interpolating variables can be a pain when
the Perl syntax and the XPATH syntax collide, so here
are some more examples to get you started:

my $p1= "p1";
my $p2= "p2";
my @res= $t->get_xpath( "p[string( '$p1') or string(
'$p2')]");
my $a= "a1";
my @res= $t->get_xpath( "//*[@att=
my $val= "a1";
my $exp= "//p[ @att='$val']"; # you need to use @ or
you will get a warning
my @res= $t->get_xpath( $exp);
XML::Twig does not provide full XPATH support. If
that's what you want then look no further than the
XML::XPath module on CPAN.
Note that the only supported regexps delimiters are /
and that you must backslash all / in regexps AND in
regular strings.
find_nodes
same as get_xpath
text
Returns a string consisting of all the PCDATA and
CDATA in an element, without any tags. The text is not
XML-escaped: base entities such as & and < are not
escaped.
set_text ($string)
Sets the text for the element: if the element is a
PCDATA, just set its text, otherwise cut all the chil
dren of the element and create a single PCDATA child
for it, which holds the text.
insert ($gi1, [$optional_atts1], $gi2,
[$optional_atts2],...)
For each gi in the list inserts an element $gi as the
only child of the element. The element gets the
optional attributes in $optional_attsn. All children
of the element are set as children of the new element.
The upper level element is returned.

$p->insert( table => { border=> 1}, 'tr', 'td')
puts $p in a table with a visible border, a single tr
and a single td and returns the table element:

<p><table border="1"><tr><td>original content of
p</td></tr></table></p>
wrap_in (@gi)
Wraps elements $gi as the successive ancestors of the
element, returns the new element. $elt->wrap_in(
'td', 'tr', 'table') wraps the element as a single
cell in a table for example.
insert_new_elt $opt_position, $gi, $opt_atts_hashref,
@opt_content
Combines a "new" and a "paste": creates a new element
using $gi, $opt_atts_hashref and @opt_content which
are arguments similar to those for "new", then paste
it, using $opt_position or 'first_child', relative to
$elt.
Returns the newly created element
erase
Erases the element: the element is deleted and all of
its children are pasted in its place.
set_content ( $optional_atts, @list_of_elt_and_strings)
( $optional_atts, '#EMPTY')
Sets the content for the element, from a list of
strings and elements. Cuts all the element children,
then pastes the list elements as the children. This
method will create a PCDATA element for any strings in
the list.
The optional_atts argument is the ref of a hash of
attributes. If this argument is used then the previous
attributes are deleted, otherwise they are left
untouched.
WARNING: if you rely on ID's then you will have to set the id yourself. At this point the element does not
belong to a twig yet, so the ID attribute is not known
so it won't be strored in the ID list.
A content of '#EMPTY' creates an empty element;
namespace
Returns the URI of the namespace that the name belongs
to. If the name doesn't belong to any namespace, undef
is returned.
expand_ns_prefix ($prefix)
Returns the uri to which the given prefix is bound in
the context of the element. Returns undef if the pre
fix isn't currently bound. Use '#default' to find the
current binding of the default namespace (if any).
current_ns_prefixes
Returna list of namespace prefixes valid for the ele
ment. The order of the prefixes in the list has no
meaning. If the default namespace is currently bound,
'#default' appears in the list.
inherit_att ($att, @optional_gi_list)
Returns the value of an attribute inherited from par
ent tags. The value returned is found by looking for
the attribute in the element then in turn in each of
its ancestors. If the @optional_gi_list is supplied
only those ancestors whose gi is in the list will be
checked.
all_children_are ($cond)
returns 1 if all children of the element pass the con
dition, 0 otherwise
level ($optional_gi)
Returns the depth of the element in the twig (root is
0). If the optional gi is given then only ancestors
of the given type are counted.
WARNING: in a tree created using the twig_roots option this will not return the level in the document tree,
level 0 will be the document root, level 1 will be the
twig_roots elements. During the parsing (in a TwigHan
dler) you can use the depth method on the twig object
to get the real parsing depth.
in ($potential_parent)
Returns true if the element is in the potential_parent
($potential_parent is an element)
in_context ($gi, $optional_level)
Returns true if the element is included in an element
whose gi is $gi, optionally within $optional_level
levels. The returned value is the including element.
pcdata
Returns the text of a PCDATA element or undef if the
element is not PCDATA.
pcdata_xml_string
Returns the text of a PCDATA element or undef if the
element is not PCDATA. The text is "XML-escaped" ('&'
and '<' are replaced by '&amp;' and '&lt;')
set_pcdata ($text)
Sets the text of a PCDATA element.
append_pcdata ($text)
Add the text at the end of a #PCDATA element.
is_cdata
Returns 1 if the element is a #CDATA element, returns
0 otherwise.
is_text
Returns 1 if the element is a #CDATA or #PCDATA ele
ment, returns 0 otherwise.
cdata
Returns the text of a CDATA element or undef if the
element is not CDATA.
set_cdata ($text)
Sets the text of a CDATA element.
append_cdata ($text)
Add the text at the end of a #CDATA element.
remove_cdata
Turns all CDATA sections in the element into regular
PCDATA elements. This is useful when converting XML to
HTML, as browsers do not support CDATA sections.
extra_data
Returns the extra_data (comments and PI's) attached to
an element
set_extra_data
Sets the extra_data (comments and PI's) attached to an
element
append_extra_data
Append extra_data to teh existing extra_data before
the element (if no previous extra_data exists then it
is created)
set_asis
Sets a property of the element that causes it to be
output without being XML escaped by the print func
tions: if it contains "a < b" it will be output as
such and not as "a &lt; b". This can be useful to cre
ate text elements that will be output as markup. Note
that all PCDATA descendants of the element are also
marked as having the property (they are the ones
impacted by the change).
If the element is a CDATA element it will also be out
put asis, without the CDATA markers. The same goesfor
any CDATA descendant of the element
set_not_asis
Unsets the asis property for the element and its text
descendants.
is_asis
Returns the asis property status of the element ( 1 or
"undef")
closed
Returns true if the element has been closed. Might be
usefull if you are somewhere in the tree, during the
parse, and have no idea whether a parent element is
completely loaded or not.
get_type
Returns the type of the element: '#ELT' for "real"
elements, or '#PCDATA', '#CDATA', '#COMMENT', '#ENT',
'#PI'
is_elt
Returns the gi if the element is a "real" element, or
0 if it is PCDATA, CDATA...
contains_only_text
Returns 1 if the element does not contain any other
"real" element
is_field
same as contains_only_text
is_pcdata
Returns 1 if the element is a #PCDATA element, returns
0 otherwise.
is_empty
Returns 1 if the element is empty, 0 otherwise
set_empty
Flags the element as empty. No further check is made,
so if the element is actually not empty the output
will be messed. The only effect of this method is that
the output will be <gi att="value""/>.
set_not_empty
Flags the element as not empty. if it is actually
empty then the element will be output as <gi
att="value""></gi>
child ($offset, $optional_gi)
Returns the $offset-th child of the element, option
ally the $offset-th child with a gi of $optional_gi.
The children are treated as a list, so $elt->child( 0)
is the first child, while $elt->child( -1) is the last
child.
child_text ($offset, $optional_gi)
Returns the text of a child or undef if the sibling
does not exist. Arguments are the same as child.
last_child ($optional_gi)
Returns the last child of the element, or the last
child whose gi is $optional_gi (ie the last of the
element children whose gi matches).
last_child_text ($optional_gi)
Same as first_child_text but for the last child.
sibling ($offset, $optional_gi)
Returns the next or previous $offset-th sibling of the
element, or the $offset-th one whose gi is
$optional_gi. If $offset is negative then a previous
sibling is returned, if $offset is positive then a
next sibling is returned. $offset=0 returns the ele
ment if there is no $optional_gi or if the element gi
matches $optional_gi, undef otherwise.
sibling_text ($offset, $optional_gi)
Returns the text of a sibling or undef if the sibling
does not exist. Arguments are the same as sibling.
prev_siblings ($optional_gi)
Returns the list of previous siblings (optionaly whose
gi is $optional_gi) for the element. The elements are
ordered in document order.
next_siblings ($optional_gi)
Returns the list of siblings (optionaly whose gi is
$optional_gi) following the element. The elements are
ordered in document order.
atts
Returns a hash ref containing the element attributes
set_atts ({att1=>$att1_val, att2=> $att2_val... })
Sets the element attributes with the hash ref supplied
as the argument
del_atts
Deletes all the element attributes.
att_names
returns a list of the attribute names for the element
att_xml_string ($att, $optional_quote)
Returns the attribute value, where '&', '<' and $quote
(" by default) are XML-escaped
set_id ($id)
Sets the id attribute of the element to the value.
See "elt_id" to change the id attribute name
id Gets the id attribute value
del_id ($id)
Deletes the id attribute of the element and remove it
from the id list for the document
DESTROY
Frees the element from memory.
start_tag
Returns the string for the start tag for the element,
including the /> at the end of an empty element tag
end_tag
Returns the string for the end tag of an element. For
an empty element, this returns the empty string ('').
xml_string ($elt)
Equivalent to $elt->sprint( 1), returns the string for
the entire element, excluding the element's tags (but
nested element tags are present)
set_pretty_print ($style)
Sets the pretty print method, amongst 'none'
(default), 'nsgmls', 'nice', 'indented', 'record' and
'record_c'
none
the default, no is used
nsgmls
nsgmls style, with added within tags
nice
adds wherever possible (NOT SAFE, can lead to
invalid XML)
indented
same as nice plus indents elements (NOT SAFE, can
lead to invalid XML)
record
table-oriented pretty print, one field per line
record_c
table-oriented pretty print, more compact than
record, one record per line
set_empty_tag_style ($style)
Sets the method to output empty tags, amongst 'normal'
(default), 'html', and 'expand',
set_indent ($string)
Sets the indentation for the indented pretty print
style (default is 2 spaces)
set_quote ($quote)
Sets the quotes used for attributes. can be 'double'
(default) or 'single'
cmp ($elt) Compare the order of the 2 elements in a
twig.
$a is the <A>..</A> element, $b is the <B>...</B>
element
document $a->cmp( $b)
<A> ... </A> ... <B> ... </B> -1
<A> ... <B> ... </B> ... </A> -1
<B> ... </B> ... <A> ... </A> 1
<B> ... <A> ... </A> ... </B> 1
$a == $b 0
$a and $b not in the same tree undef
before ($elt)
Returns 1 if $elt starts before the element, 0 other
wise. If the 2 elements are not in the same twig then
return undef.

if( $a->cmp( $b) == -1) { return 1; } else { re
turn 0; }
after ($elt)
Returns 1 if $elt starts after the element, 0 other
wise. If the 2 elements are not in the same twig then
return undef.

if( $a->cmp( $b) == -1) { return 1; } else { re
turn 0; }
path
Returns the element context in a form similar to
XPath's short form: '/root/gi1/../gi'
private methods
set_parent ($parent)
set_first_child ($first_child)
set_last_child ($last_child)
set_prev_sibling ($prev_sibling)
set_next_sibling ($next_sibling)
set_twig_current
del_twig_current
twig_current
flushed
This method should NOT be used, always flush the
twig, not an element.
set_flushed
del_flushed
flush
contains_text
Those methods should not be used, unless of course you
find some creative and interesting, not to mention
useful, ways to do it.
cond
Most of the navigation functions accept a condition as an
optional argument The first element (or all elements for
"children" or "ancestors") that passes the condition is
returned.
The condition can be
#ELT
return a "real" element (not a PCDATA, CDATA, comment
or pi element)
#TEXT
return a PCDATA or CDATA element
XPath expression
actually a subset of XPath that makes sense in this
context

gi
/regexp/
gi[@att]
gi[@att="val"]
gi[@att=~/regexp/]
gi[text()="blah"]
gi[text(subelt)="blah"]
gi[text()=~ /blah/]
gi[text(subelt)=~ /blah/]
*[@att] (the * is actually optional)
*[@att="val"]
*[@att=~/regexp/]
regular expression
return an element whose gi matches the regexp. The
regexp has to be created with "qr//" (hence this is
available only on perl 5.005 and above)
code reference
applies the code, passing the current element as argu
ment, if the code returns true then the element is
returned, if it returns false then the code is applied
to the next candidate.
Entity_list
new Creates an entity list.
add ($ent)
Adds an entity to an entity list.
delete ($ent or $gi).
Deletes an entity (defined by its name or by the
Entity object) from the list.
print ($optional_filehandle)
Prints the entity list.
Entity
new ($name, $val, $sysid, $pubid, $ndata)
Same arguments as the Entity handler for XML::Parser.
print ($optional_filehandle)
Prints an entity declaration.
text
Returns the entity declaration text.

EXAMPLES

See the test file in t/test[1-n].t Additional examples
(and a complete tutorial) can be found at
http://www.xmltwig.com/

To figure out what flush does call the following script
with an
xml file and an element name as arguments

use XML::Twig;
my ($file, $elt)= @ARGV;
my $t= XML::Twig->new( twig_handlers =>
{ $elt => sub {$_[0]->flush; print "here]0;} });
$t->parsefile( $file, ErrorContext => 2);
$t->flush;
print "0;

NOTES

DTD Handling

There are 3 possibilities here. They are:

No DTD
No doctype, no DTD information, no entity information,
the world is simple...
Internal DTD
The XML document includes an internal DTD, and maybe
entity declarations.
If you use the load_DTD option when creating the twig
the DTD information and the entity declarations can be
accessed.
The DTD and the entity declarations will be flush'ed
(or print'ed) either as is (if they have not been mod
ified) or as reconstructed (poorly, comments are lost,
order is not kept, due to it's content this DTD should
not be viewed by anyone) if they have been modified.
You can also modify them directly by changing the
$twig->{twig_doctype}->{internal} field (straight from
XML::Parser, see the Doctype handler doc)
External DTD
The XML document includes a reference to an external
DTD, and maybe entity declarations.
If you use the load_DTD when creating the twig the DTD
information and the entity declarations can be
accessed. The entity declarations will be flush'ed (or
print'ed) either as is (if they have not been modi
fied) or as reconstructed (badly, comments are lost,
order is not kept).
You can change the doctype through the $twig->set_doc
type method and print the dtd through the
$twig->dtd_text or $twig->dtd_print methods.
If you need to modify the entity list this is probably
the easiest way to do it.
Flush
If you set handlers and use flush, do not forget to flush
the twig one last time AFTER the parsing, or you might be
missing the end of the document.
Remember that element handlers are called when the element
is CLOSED, so if you have handlers for nested elements the
inner handlers will be called first. It makes it for exam
ple trickier than it would seem to number nested clauses.

BUGS

entity handling
Due to XML::Parser behaviour, non-base entities in
attribute values disappear: "att="val&ent;"" will be
turned into att => val, unless you use the
"keep_encoding" argument to "XML::Twig->new"
DTD handling
Basically the DTD handling methods are competely
bugged. No one uses them and it seems very difficult
to get them to work in all cases, including with 2
slightly incompatible versions of XML::Parser.
So use XML::Twig with standalone documents, or with
documents refereing to an external DTD, but don't
expect it to properly parse and even output back the
DTD.
memory leak
If you use a lot of twigs you might find that you leak
quite a lot of memory (about 2Ks per twig). You can
use the "dispose" method to free that memory after you
are done.
If you create elements the same thing might happen,
use the "delete" method to get rid of them.
Alternatively installing the WeakRef module on a ver
sion of Perl that supports it will get rid of the mem
ory leaks automagically.
ID list
The ID list is NOT updated when ID's are modified or
elements cut or deleted.
change_gi
This method will not function properly if you do:

$twig->change_gi( $old1, $new);
$twig->change_gi( $old2, $new);
$twig->change_gi( $new, $even_newer);
sanity check on XML::Parser method calls
XML::Twig should really prevent calls to some
XML::Parser methods, especially the setHandlers
method.
pretty printing
Pretty printing (at least using the 'indented' style)
is hard! You will get a proper pretty printing only if
you output elements that belong to the document.
printing elements that have been cut makes it impossi
ble for XML::Twig to figure out their depth, and thus
their indentation level.
Also there is an anavoidable bug when using "flush"
and pretty printing for elements with mixed content
that start with an embedded element:

<elt><b>b</b>toto<b>bold</b></elt>
will be output as
<elt>
<b>b</b>toto<b>bold</b></elt>
if you flush the twig when you find the <b> element
them loose

Globals

These are the things that can mess up calling code, espe
cially if threaded. They might also cause problem under
mod_perl.

Exported constants
Whether you want them or not you get them! These are
subroutines to use as constant when creating or test
ing elements
PCDATA
returns '#PCDATA'
CDATA
returns '#CDATA'
PI returns '#PI', I had the choice between PROC and
PI :--(
Module scoped values: constants
these should cause no trouble:

%base_ent= ( '>' => '&gt;',
'<' => '&lt;',
'&' => '&amp;',
"'" => '&apos;',
'"' => '&quot;',
);
CDATA_START = "<![CDATA[";
CDATA_END = "]]>";
PI_START = "<?";
PI_END = "?>";
COMMENT_START = "<!--";
COMMENT_END = "-->";
pretty print styles

( $NSGMLS, $NICE, $INDENTED, $RECORD1, $RECORD2)=
(1..5);
empty tag output style

( $HTML, $EXPAND)= (1..2);
Module scoped values: might be changed
Most of these deal with pretty printing, so the worst
that can happen is probably that XML output does not
look right, but is still valid and processed identi
cally by XML processors.
$empty_tag_style can mess up HTML bowsers though and
changing $ID would most likely create problems.

$pretty=0; # pretty print style
$quote='"'; # quote for attributes
$INDENT= ' '; # indent for indented pretty
print
$empty_tag_style= 0; # how to display empty tags
$ID # attribute used as a gi ('id'
by default)
Module scoped values: definitely changed
These 2 variables are used to replace gi's by an
index, thus saving some space when creating a twig. If
they really cause you too much trouble, let me know,
it is probably possible to create either a switch or
at least a version of XML::Twig that does not perform
this optimisation.

%gi2index; # gi => index
@index2gi; # list of gi's

TODO

SAX handlers
Allowing XML::Twig to work on top of any SAX parser,
and to emit SAX events to a handler is a priority for
version 3.01
multiple twigs are not well supported
A number of twig features are just global at the
moment. These include the ID list and the "gi pool"
(if you use change_gi then you change the gi for ALL
twigs).
A future version will try to support this while trying
not to be to hard on performance (at least when a sin
gle twig is used!).

BENCHMARKS

You can use the "benchmark_twig" file to do additional
benchmarks. Please send me benchmark information for
additional systems.

AUTHOR

Michel Rodriguez <m.v.rodriguez@ieee.org>

This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

Bug reports and comments to m.v.rodriguez@ieee.org

The XML::Twig page is at http://www.xmltwig.com/xmltwig/
It includes examples and a tutorial at
http://www.xmltwig.com/xmltwig/tutorial/index.html

SEE ALSO

XML::Parser
Copyright © 2010-2025 Platon Technologies, s.r.o.           Index | Man stránky | tLDP | Dokumenty | Utilitky | O projekte
Design by styleshout