domparse(3)
NAME
XML::Xerces::DOMParse - A Perl module for parsing DOMs.
SYNOPSIS
# Here;s an example that reads in an XML file from the # command line and then removes all formatting, re-adds # formatting and then prints the DOM back to a file. use XML::Xerces; use XML::Xerces::DOMParse; my $parser = new XML::Xerces::DOMParser (); $parser->parse ($ARGV[0]); my $doc = $parser->getDocument (); XML::Xerces::DOMParse::unformat ($doc); XML::Xerces::DOMParse::format ($doc); XML::Xerces::DOMParse::print (TDOUT, $doc);
DESCRIPTION
Use this module in conjunction with XML::Xerces. Once you
have read an XML file into a DOM tree in memory, this mod
ule provides routines for recursive descent parsing of the
DOM tree. It also provides three concrete and useful
functions to format, unformat and print DOM trees, all
which are built on the more general parsing functions.
FUNCTIONS
DOMParse::unformat ($node)
Processes $node and its children recursively and removes
all white space text nodes. It is often difficult to pro
cess a DOM tree with formatting while preserving reason
able formatting. Use unformat to remove formatting, then
proces the unformatted DOM, then use format to add format
ting back in that is reasonable for the new tree.
DOMParse::format ($node)
Processes $node and its children recursively and intro
duces white space text nodes to create a DOM tree that
will print with reasonable indents and newlines. Only
call format on a DOM tree that nas no formatting white
space in it. Otherwise the results will be incorrect.
Call unformat to remove formatting white space.
You can optionally set the string variable $INDENT to the
indent characters you want to use. By default it is a
single tab.
DOMParse::print ($file_handle, $node)
- Processes $node and its children recursively and prints
the DOM tree to $file_handle as a standard XML file. You
can override printing behavior by supplying any of several
"printer" functions. - $NODE_PRINTER
$DOCUMENT_NODE_PRINTER
$DOCUMENT_TYPE_NODE_PRINTER
$COMMENT_NODE_PRINTER
$TEXT_NODE_PRINTER
$CDATA_SECTION_NODE_PRINTER
$ELEMENT_NODE_PRINTER
$ENTITY_REFERENCE_NODE_PRINTER
$PROCESSING_INSTRUCTION_NODE_PRINTER
$ATTRIBUTE_PRINTER - Some of these printers call other printers. For example,
$NODE_PRINTER determines the node type and calls the cor
reponsing printer for that type, e.g. $ELE
MENT_NODE_PRINTER. So if you replace a printer for a node
which has children, you must take the responsibility for
calling the child node printers. - All printers take two parameters, a file handle and the
node. See DOMParse::parse_nodes and DOM
Parse::parse_child_nodes for details. - It is very easy to write a replacement printer that adds
value and then calls the default processing as follows.
my $original_text_node_printer = $TEXT_NODE_PRINT- ER;
$TEXT_NODE_PRINTER = my_text_node_printer; - sub my_text_node_printer {
my ($fh, $node) = @_;
# look at the text node and do something extra
return &$original_text_node_printer ($fh,$node); - }
- The $ESCAPE variable (integer) controls whether special
XML characters like ampersand "&" are escaped, e.g.
"&". Set $ESCAPE to 1 (default) to escape special
characters, or to 0 to print characters literally. - print_string ($file_handle, $node)
- Call print_string whenever you need to expand special
characters (& < > ") to their escape sequence equivalents.
The print_string is used extensively by the default imple
mentation of DOMParse::print. When you replace various
node printers, you should also be careful to use it to
print node and attribute names and values (but probably
not anything else). - The print function respects the global $ESCAPE flag. By
default it is set to true (1) and escape conversion is
performed. Set it to false (0) when you don't want escape
conversion. - parse_nodes ($node, $process_node, $data)
- Call parse_nodes to parse $node and all of its children
recursively. Each node will be visited and your parsing
function, $process_node, will be called. Optional data
$data will be passed through if provided. - Your parsing funtion must have the following signature.
process_node ($node, $data)- If it returns 1 then children of $node will also be
parsed. If it returns 0 then they won't. It is common to
use one parsing function to get to a certain level in the
DOM tree, then to return 0 and to call parse_child_nodes
to parse nodes under that level with a different process
ing function. - parse_child_nodes ($node, $process_node, $data)
- Call to parse the children of $node recursively. This is
just like parse_nodes except that $node is not parsed. - doc ($node)
- Looks up the DOM tree until it finds the document node
associated with the given $node. Then returns the docu
ment node. - depth ($node)
- Returns the depth of the specified $node in the DOM docu
ment. The document has depth 0, the root node has depth
1, and so on. - element_text ($node)
- It is common practice to have an element node that
encloses a single text node. If you know you have such a
node, you can call element_text to directly access the
enclosed text as a string. This is faster than accessing
the enclosed text node and then getting the value of it. - insert_before ($ref_node, $new_node)
- Inserts $new_node in the DOM tree immediately before and
as a sibling of $ref_node. It is safe to call
insert_before while in the middle of parsing a DOM tree if
$ref_node is the current node being parsed. The newly
inserted node will not be parsed. - insert_after ($ref_node, $new_node)
- Inserts $new_node in the DOM tree immediately after and as
a sibling of $ref_node. It is safe to call insert_after
while in the middle of parsing a DOM tree if $ref_node is
the current node being parsed. The newly inserted node
will not be parsed. - remove ($node)
- Removes $node from the DOM tree. It is safe to call
remove while in the middle of parsing a DOM tree if $node
is the current node being parsed. The next node to be
parsed will be the same that would have been parsed had
$node not been removed, e.g. $node's next sibling.
AUTHORS
Tom Watson <rtwatson@us.ibm.com> wrote version 1.0 and
submitted to the XML Apache project
<http://xml.apache.org>, where you can contribute to
future versions and where the corresponding C++ and Java
compilers are also developed as OpenSource projects.
Jason Stewart <jason@openinformatics.com> adapted it to
the Xerces-1.3 API.
BUGS
- Any comments or questions about this module can be
addressed to the Xerces.pm development list
<xerces-p-dev@xml.apache.org>