libxml(3)
NAME
XML::LibXML - Interface to the gnome libxml2 library
SYNOPSIS
use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->parse_string(<<'EOT'); <some-xml/> EOT
DESCRIPTION
This module is an interface to the gnome libxml2 DOM
parser (no SAX parser support yet), and the DOM tree. It
also provides an XML::XPath-like findnodes() interface,
providing access to the XPath API in libxml2.
OPTIONS
LibXML options are global (unfortunately this is a limita
tion of the underlying implementation, not this inter
face). They can either be set using
"$parser->option(...)", or "XML::LibXML->option(...)",
both are treated in the same manner. Note that even two
forked processes will share some of the same options, so
be careful out there!
Every option returns the previous value, and can be called
without parameters to get the current value.
- validation
- $parser->validation(1);
- Turn validation on (or off). Defaults to off.
- recover
$parser->recover(1);- Turn the parsers recover mode on (or off). Defaults to
off. - This allows to parse broken XML data into memory. This
switch will only work with XML data rather than HTML data.
Also the validation will be switched off automaticly. - The recover mode helps to recover documents that are
almost wellformed very efficiently. That is for example a
document that forgets to close the document tag (or any
other tag inside the document). The recover mode of
XML::LibXML has problems though to restore documents that
are more like well ballanced chunks. In that case
XML::LibXML will only parse the first tag of the chunk. - expand_entities
$parser->expand_entities(0);- Turn entity expansion on or off, enabled by default. If
entity expansion is off, any external parsed entities in
the document are left as entities. Probably not very use
ful for most purposes. - keep_blanks
$parser->keep_blanks(0);- Allows you to turn off XML::LibXML's default behaviour of
maintaining whitespace in the document. - pedantic_parser
$parser->pedantic_parser(1);- You can make XML::LibXML more pedantic if you want to.
- load_ext_dtd
$parser->load_ext_dtd(1);- Load external DTD subsets while parsing.
- complete_attributes
$parser->complete_attributes(1);- Complete the elements attributes lists with the ones
defaulted from the DTDs. By default, this option is
enabled. - expand_xinclude
$parser->expand_xinclude- Expands XIinclude tags imidiatly while parsing the docu
ment. This flag ashures that the parser callbacks are used
while parsing the included Document. - load_catalog
$parser->load_catalog( $catalog_file );- Will use $catalog_file as a catalog during all parsing
processes. Using a catalog will significantly speed up
parsing processes if many external ressources are loaded
into the parsed documents (such as DTDs or XIncludes) - Note that catalogs will not be available if an external
entity handler was specified. At the current state it is
not possible to make use of both types of resolving sys
tems at the same time. - base_uri
$parser->base_uri( $your_base_uri );- In case of parsing strings or file handles, XML::LibXML
doesn't know about the base uri of the document. To make
relative references such as XIncludes work, one has to set
a separate base URI, that is then used for the parsed doc
uments. - gdome_dom
$parser->gdome_dom(1);- Although quite powerful XML:LibXML's DOM implementation is
limited if one needs or wants full DOM level 2 or level 3
support. XML::GDOME is based on libxml2 as well but pro
vides a rather complete DOM implementation by wrapping
libgdome. This allows you to make use of XML::LibXML's
full parser options and XML::GDOME's DOM implementation at
the same time. - All XML::LibXML parser functions recognize this switch.
- match_callback
$parser->match_callback($subref);- Sets a "match" callback. See "Input Callbacks" below.
- open_callback
$parser->open_callback($subref);- Sets an open callback. See "Input Callbacks" below.
- read_callback
$parser->read_callback($subref);- Sets a read callback. See "Input Callbacks" below.
- close_callback
$parser->close_callback($subref);- Sets a close callback. See "Input Callbacks" below.
CONSTRUCTOR
The XML::LibXML constructor, "new()", takes the following
parameters:
- ext_ent_handler
- my $parser = XML::LibXML->new(ext_ent_handler => sub {
- ... });
- The ext_ent_handler sub is called whenever libxml needs to
load an external parsed entity. The handler sub will be
passed two parameters: a URL (SYSTEM identifier) and an ID
(PUBLIC identifier). It should return a string containing
the resource at the given URI. - Note that you do not need to enable this - if not supplied
libxml will get the resource either directly from the
filesystem, or using an internal http client library. - catalog
my $parser = XML::LibXML->new( catalog => $private_cata- log );
- Alternatively to ext_ent_handler the catalog parameter
allows to use libxml2's catalog interface directly. The
parameter takes a filename to a catalog file. This catalog
is loaded by libxml2 and will be used during parsing pro
cesses. - Note that catalogs will not be available if an external
entity handler was specified. At the current state it is
not possible to make use of both types of resolving sys
tems at the same time.
DEFAULT VALUES
The following table gives an overview about the default
values of the parser attributes.
validation == off (0)
recover == off (0)
expand_entities == on (1)
keep_blanks == on (1)
pedantic_parser == off (0)
load_ext_dtd == on (1)
complete_attributes == on (1)
expand_xinclude == off (0)
base_uri == ""
gdome_dom == off (0)
By default no callback handler is set.
PARSING
There are three ways to parse documents - as a string, as
a Perl filehandle, or as a filename. The return value from
each is a XML::LibXML::Document object, which is a DOM
object (although not all DOM methods are implemented yet).
See "XML::LibXML::Document" below for more details on the
methods available on documents.
Each of the below methods will throw an exception if the
document is invalid. To prevent this causing your program
exiting, wrap the call in an eval{} block.
- parse_string
- my $doc = $parser->parse_string($string);
- or, passing in a directory to use as the "base":
my $doc = $parser->parse_string($string, $dir);- parse_fh
my $doc = $parser->parse_fh($fh);- Here, $fh can be an IOREF, or a subclass of IO::Handle.
- And again, you can pass in a directory as the "base":
my $doc = $parser->parse_fh($fh, $dir);- Note in the above two cases, $dir must end in a trailing
slash, otherwise the parent of that directory is used.
This can actually be useful, in that it will accept the
filename of what you're parsing. - parse_file
my $doc = $parser->parse_file($filename);- This function reads an absolute filename into the memory.
It causes XML::LibXML to use libxml2's file parser instead
of letting perl reading the file such as with parse_fh(). If you need to parse files directly, this function would
be the faster choice, since this function is about 6-8
times faster then parse_fh(). - Parsing Html
- As of version 0.96, XML::LibXML is capable of parsing HTML
into a regular XML DOM. This gives you the full power of
XML::LibXML on HTML documents. - The methods work in exactly the same way as the methods
above, and return exactly the same type of object. If you
wish to dump the resulting document as HTML again, you can
use "$doc-"toStringHTML()> to do that. - parse_html_string
my $doc = $parser->parse_html_string($string);- parse_html_fh
my $doc = $parser->parse_html_fh($fh);- parse_html_file
my $doc = $parser->parse_html_file($filename);- Push Parser
- XML::LibXML supports also a push parser interface. This
allows one to parse large documents without actually load
ing the entire document into memory. - The interface is devided into two parts:
- · pushing the data into the parser
· finish the parse - The user has no chance to access the document while still
pushing the data to the parser. The resulting document
will be returned when the parser is told to finish the
parsing process. - $parser->push( @data )
- This function pushs the data stored inside the array
to libxml2's parse. Each entry in @data must be a nor
mal scalar! - $parser->finish_push( $restore );
- This function returns the result of the parsing pro
cess. If this function is called without a parameter
it will complain about non wellformed documents. If
$restore is 1, the push parser can be used to restore
broken or non well formed (XML) documents as the fol
lowing example shows:
$parser->push( "<foo>", "bar" );
eval { $doc = $parser->finish_push(); }; # willcomplain
if ( $@ ) {# ...} - This can be anoing if the closing tag misses by acci
dent. The following code will restore the document:will not complainwarn $doc->toString(); # returns "<foo>bar</foo>" - of course finish_push() will return nothing if there was no data pushed to the parser before.
- Extra parsing methods
- processXIncludes
$parser->processXIncludes( $doc );- While the document class implements a separate XInclude
processing, this method, is stricly related to the parser.
The use of this method is only required, if the parser
implements special callbacks that should to be used for
the XInclude as well. - If expand_xincludes is set to 1, the method is only
required to process XIncludes appended to the DOM after
its original parsing. - Error Handling
- XML::LibXML throws exceptions during parseing, validation
or XPath processing. These errors can be catched by useing
eval blocks. The error then will be stored in $@. Alterna
tively one can use the get_last_error() function of XML::LibXML. It will return the same string that is stored
in $@. Using get_last_error() makes it still nessecary to eval the statement, since these function groups will die()
on errors. - get_last_error() can be called either by the class itself or by a parser instance:
$errstring = XML::LibXML->get_last_error();
$errstring = $parser->get_last_error();- Note that XML::LibXML exceptions are global. That means if
get_last_error is called on an parser instance, the last
global error will be returned. This is not nessecarily the error caused by the parser instance itself. - Serialization
- The oposite of parsing is serialization. In XML::LibXML
this can be done by using the functions toString(), toFile() and toFH(). All serialization functions under stand the flag setTagCompression. if this Flag is set to 1
empty tags are displayed as <foo></foo> rather than
<foo/>. - toString() additionally checks two other flags:
- skipDTD and skipXMLDeclaration
- If skipDTD is specified and any DTD node is found in the
document this will not be serialized. - If skipXMLDeclaration is set to 1 the documents xml decla
ration is not serialized. This flag will cause the docu
ment to be serialized as UTF8 even if the document has an
other encoding specified. - XML::LibXML does not define these flags itself, therefore
they have to specify them manually by the caller:
local $XML::LibXML::skipXMLDeclaration = 1;
local $XML::LibXML::skipDTD = 1;
local $XML::LibXML::setTagCompression = 1;- will cause the serializer to avoid the XML declaration for
a document, skip the DTD if found, and expand empty tags. - *NOTE* $XML::LibXML::skipXMLDeclaration and
$XML::LibXML::skipDTD are only recognized by the Documents
toString() function. - Additionally it is possible to serialize single nodes by
using toString() for the node. Since a node has no DTD and no XML Declaration the related flags will take no effect.
Nevertheless setTagCompression is supported. - All basic serialization function recognize an additional
formating flag. This flag is an easy way to format complex
xml documents without adding ignoreable whitespaces. - Input Callbacks
- The input callbacks are used whenever LibXML has to get
something other than external parsed entities from some where. The input callbacks in LibXML are stacked on top of
the original input callbacks within the libxml library.
This means that if you decide not to use your own call
backs (see "match()"), then you can revert to the default
way of handling input. This allows, for example, to only
handle certain URI schemes. - Callbacks are only used on files, but not on strings or
filehandles. This is because LibXML requires the match
event to find out about which callback set is shall be
used for the current input stream. LibXML can decide this
only before the stream is open. For LibXML strings and
filehandles are already opened streams. - The following callbacks are defined:
- match(uri)
- If you want to handle the URI, simply return a true
value from this callback. - open(uri)
- Open something and return it to handle that resource.
- read(handle, bytes)
- Read a certain number of bytes from the resource. This
callback is called even if the entire Document has
already read. - close(handle)
- Close the handle associated with the resource.
- Example
- This is a purely fictitious example that uses a
MyScheme::Handler object that responds to methods similar
to an IO::Handle.
$parser->match_callback(match_uri);- $parser->open_callback(open_uri);
- $parser->read_callback(read_uri);
- $parser->close_callback(close_uri);
- sub match_uri {
my $uri = shift;
return $uri =~ /^myscheme:/; - }
- sub open_uri {
my $uri = shift;
return MyScheme::Handler->new($uri); - }
- sub read_uri {
my $handler = shift;
my $length = shift;
my $buffer;
read($handler, $buffer, $length);
return $buffer; - }
- sub close_uri {
my $handler = shift;
close($handler); - }
- A more realistic example can be found in the "example"
directory - Since the parser requires all callbacks defined it is also
possible to set all callbacks with a single call of call_
backs(). This would simplify the example code to:
$parser->callbacks( match_uri, open_uri, read_uri,- close_uri);
- All functions that are used to set the callbacks, can also
be used to retrieve the callbacks from the parser. - Global Callbacks
- Optionaly it is possible to apply global callback on the
XML::LibXML class level. This allows multiple parses to
share the same callbacks. To set these global callbacks
one can use the callback access functions directly on the
class.
XML::LibXML->callbacks( match_uri, open_uri, read_uri,- close_uri);
- The previous code snippet will set the callbacks from the
first example as global callbacks. - Encoding
- All data will be stored UTF-8 encoded. Nevertheless the
input and output functions are aware about the encoding of
the owner document. By default all functions will assume,
UTF-8 encoding of the passed strings unless the owner doc
ument has a different encoding. In such a case the func
tions will assume the encoding of the document to be
valid. - At the current state of implementation query functions
like ffiinnddnnooddeess(()), ggeettEElleemmeennttssBByyTTaaggNNaammee(()) or ggeettAAttttrriibbuuttee(()) accept only UTF-8 encoded strings, even if the underlaying
document has a different encoding. At first this seems to
be a limitation, but on application level there is no way
to make save asumptations about the encoding of the
strings. - Future releases will offer the opportunity to force an
application wide encoding, so make shure that you
installed the latest version of XML::LibXML. - To encode or decode a string to or from UTF-8 XML::LibXML
exports two functions, which use the encoding mechanism of
the underlaying implementation. These functions should be
used, if external encoding is required (e.g. for query
functions). - encodeToUTF8
$encodedstring = encodeToUTF8( $name_of_encoding,- $sting_to_encode );
- The function will encode a string from the specified
encoding to UTF-8. - decodeFromUTF8
$decodedstring = decodeFromUTF8($name_of_encoding,- $string_to_decode );
- This Function transforms an UTF-8 encoded string the spec
ified encoding. While transforms to ISO encodings may
cause errors if the given stirng contains unsupported
characters, this function can transform to UTF-16 encod
ings as well. - XML::LibXML and XML::GDOME
- THE FUNCTIONS DESCRIBED HERE ARE STILL EXPERIMENTAL
- Although both modules make use of libxml2's XML capabili
ties, the DOM implementation of both modules are not com
patible. But still it is possible to exchange nodes from
one DOM to the other. The concept of this exchange is
pretty similar to the function cloneNode(): The particular node is copied on the lowlevel to the opposite DOM imple
mentation. - Since the DOM implementations cannot coexist with in one
document, one is forced to copy each node that should be
used. Because of keeping allways two nodes this may cause
quite an impact on a machines memory useage. - XML::LibXML provides two functions to export or import
GDOME nodes: import_GDOME() and export_GDOME(). Both func tion have two parameters: the node and a flag for recur
sive import. The flag works as in cloneNode(). - import_GDOME
XML::LibXML->import_GDOME( $node, $deep );- This converts an XML::GDOME node to XML::LibXML explic
itly. - export_GDOME
XML::LibXML->export_GDOME( $node, $deep );- Allows to export an XML::LibXML node to XML::GDOME explic
itly. - Although these two explicit functions exist, XML::LibXML
allows also the transparent import of XML::GDOME nodes in
functions such as appendChild(), insertAfter() and so on. While native nodes are automaticly adopted in most func
tions XML::GDOME nodes are allways cloned in advance. Thus if the original node is modified after the operation, the
node in the XML::LibXML document will not have this infor
mation.
XML::LibXML::Dtd
This module allows you to parse and return a DTD object.
It has one method right now, "new()".
- new()
- my $dtd = XML::LibXML::Dtd->new($public, $system);
- Creates a new DTD object from the public and system iden
tifiers. It will automatically load the objects from the
filesystem, or use the input callbacks (see "Input Call
backs" below) to load the DTD.
Processing Instructions - XML::LibXML::PI
Processing instructions are implemented with XML::LibXML
with read and write access ;) The PI data is the PI with
out the PI target (as specified in XML 1.0 [17]) as a
string. This string can be accessed with getData as imple
mented in XML::LibXML::Node.
- The write access is aware about the fact, that many pro
cessing instructions have attribute like data. Therefor
setData provides besides the DOM spec conform Interface to
pass a set of named parameter. So the code segment - my $pi = $dom->createProcessingInstruction("abc");
$pi->setData(foo=>'bar', foobar=>'foobar');
$dom->appendChild( $pi ); - will result the following PI in the DOM:
<?abc foo="bar" foobar="foobar"?>- The same can be done with
$pi->setData( 'foo="bar" foobar="foobar"' );- Which is how it is specified in the "DOM specification".
This three step interface creates temporary a node in perl
space. This can be avoided while using the insertPro cessingInstruction method. Instead of the three calls described above, the call "$dom-"insertProcessingInstruc
tion("abc",'foo="bar" foobar="foobar"');> will have the
same result as above. - Currently only the sseettDDaattaa(()) function accepts named param
eters, while only strings are accepted by the other meth
ods. - createProcessingInstruction
- SYNOPSIS:
$pinode = $dom->createProcessingInstruction( $target );- or
$pinode = $dom->createProcessingInstruction( $target,- $data );
- This function creates a new PI and returns this node. The
PI is bound to the DOM, but is not appended to the DOM
itself. To add the PI to the DOM, one needs to use aappppeenndd_ CChhiilldd(()) directly on the dom itself. - insertProcessingInstruction
- SYNOPSIS:
$dom->insertProcessingInstruction( $target, $data );- Creates a processing instruction and inserts it directly
to the DOM. The function does not return a node. - createPI
- alias for createProcessingInstruction
- insertPI
- alias for insertProcessingInstruction
- setData
- SYNOPSIS:
$pinode->setData( $data_string );- or
$pinode->setData( name=>string_value [...] );- This method allows to change the content data of a PI.
Additionaly to the interface specified for DOM Level2, the
method provides a named parameter interface to set the
data. This parameterlist is converted into a string before
it is appended to the PI.
AUTHOR
Matt Sergeant, matt@sergeant.org
Copyright 2001, AxKit.com Ltd. All rights reserved.
SEE ALSO
- XML::LibXSLT, XML::LibXML::DOM, XML::LibXML::SAX