html::element(3)

NAME

HTML::Element - Class for objects that represent HTML ele
ments

SYNOPSIS

use HTML::Element;
$a      =      HTML::Element->new('a',      href      =>
'http://www.perl.com/');
$a->push_content("The Perl Homepage");
$tag = $a->tag;
print "$tag starts out as:",  $a->starttag, "0;
print "$tag ends as:",  $a->endtag, "0;
print  "$tag´s  href  attribute is: ", $a->attr('href'),
"0;
$links_r = $a->extract_links();
print "Hey, I found ", scalar(@$links_r), " links.0;
print "And that, as HTML, is: ", $a->as_HTML, "0;
$a = $a->delete;

DESCRIPTION

(This class is part of the HTML::Tree dist.)

Objects of the HTML::Element class can be used to repre
sent elements of HTML document trees. These objects have
attributes, notably attributes that designates each ele
ment's parent and content. The content is an array of
text segments and other HTML::Element objects. A tree
with HTML::Element objects as nodes can represent the syn
tax tree for a HTML document.

HOW WE REPRESENT TREES

It may occur to you to wonder what exactly a "tree" is,
and how it's represented in memory. Consider this HTML
document:
<html lang='en-US'>
<head>
<title>Stuff</title>
<meta name='author' content='Jojo'>
</head>
<body>
<h1>I like potatoes!</h1>
</body>
</html>
Building a syntax tree out of it makes a tree-structure in
memory that could be diagrammed as:

html (lang='en-US')
/ /
/ head body
/ /
/ title meta
h1
| (name='author',
"Stuff" content='Jojo') "I like potatoes"
This is the traditional way to diagram a tree, with the
"root" at the top, and it's this kind of diagram that peo
ple have in mind when they say, for example, that "the
meta element is under the head element instead of under
the body element". (The same is also said with "inside"
instead of "under" -- the use of "inside" makes more sense
when you're looking at the HTML source.)
Another way to represent the above tree is with indenting:

html (attributes: lang='en-US')
head
title
"Stuff"
meta (attributes: name='author' content='Jojo')
body
h1
"I like potatoes"
Incidentally, diagramming with indenting works much better
for very large trees, and is easier for a program to gen
erate. The $tree->dump method uses indentation just that
way.
However you diagram the tree, it's stored the same in mem
ory -- it's a network of objects, each of which has
attributes like so:

element #1: _tag: 'html'
_parent: none
_content: [element #2, element #5]
lang: 'en-US'
element #2: _tag: 'head'
_parent: element #1
_content: [element #3, element #4]
element #3: _tag: 'title'
_parent: element #2
_content: [text segment "Stuff"]
element #4 _tag: 'meta'
_parent: element #2
_content: none
name: author
content: Jojo
element #5 _tag: 'body'
_parent: element #1
_content: [element #6]
element #6 _tag: 'h1'
_parent: element #5
_content: [text segment "I like potatoes"]
The "treeness" of the tree-structure that these elements
comprise is not an aspect of any particular object, but is
emergent from the relatedness attributes (_parent and
_content) of these element-objects and from how you use
them to get from element to element.
While you could access the content of a tree by writing
code that says "access the 'src' attribute of the root's
first child's seventh child's third child", you're more likely to have to scan the contents of a tree, looking for
whatever nodes, or kinds of nodes, you want to do some
thing with. The most straightforward way to look over a
tree is to "traverse" it; an HTML::Element method
($h->traverse) is provided for this purpose; and several
other HTML::Element methods are based on it.
(For everything you ever wanted to know about trees, and
then some, see Niklaus Wirth's Algorithms + Data Struc_ tures = Programs or Donald Knuth's The Art of Computer Programming, Volume 1.)

BASIC METHODS

$h = HTML::Element->new('tag', 'attrname' => 'value', ...
)
This constructor method returns a new HTML::Element
object. The tag name is a required argument; it will
be forced to lowercase. Optionally, you can specify
other initial attributes at object creation time.
$h->attr('attr') or $h->attr('attr', 'value')
Returns (optionally sets) the value of the given
attribute of $h. The attribute name (but not the
value, if provided) is forced to lowercase. If trying
to read the value of an attribute not present for this
element, the return value is undef. If setting a new
value, the old value of that attribute is returned.
If methods are provided for accessing an attribute
(like $h->tag for "_tag", $h->content_list, etc.
below), use those instead of calling attr $h->attr,
whether for reading or setting.
Note that setting an attribute to undef (as opposed to
"", the empty string) actually deletes the attribute.
$h->tag() or $h->tag('tagname')
Returns (optionally sets) the tag name (also known as
the generic identifier) for the element $h. In set
ting, the tag name is always converted to lower case.
There are four kinds of "pseudo-elements" that show up
as HTML::Element objects:
Comment pseudo-elements
These are element objects with a "$h->tag" value
of "~comment", and the content of the comment is
stored in the "text" attribute
("$h->attr("text")"). For example, parsing this
code with HTML::TreeBuilder...

<!-- I like Pie.
Pie is good
-->
produces an HTML::Element object with these
attributes:

"_tag",
"~comment",
"text",
" I like Pie. Pie is good "
Declaration pseudo-elements
Declarations (rarely encountered) are represented
as HTML::Element objects with a tag name of "~dec
laration", and content in the "text" attribute.
For example, this:

<!DOCTYPE foo>
produces an element whose attributes include:

"_tag", "~declaration", "text", "DOCTYPE foo"
Processing instruction pseudo-elements
PIs (rarely encountered) are represented as
HTML::Element objects with a tag name of "~pi",
and content in the "text" attribute. For example,
this:

<?stuff foo?>
produces an element whose attributes include:

"_tag", "~pi", "text", "stuff foo?"
(assuming a recent version of HTML::Parser)
~literal pseudo-elements
These objects are not currently produced by
HTML::TreeBuilder, but can be used to represent a
"super-literal" -- i.e., a literal you want to be
immune from escaping. (Yes, I just made that term
up.)
That is, this is useful if you want to insert code
into a tree that you plan to dump out with
"as_HTML", where you want, for some reason, to
suppress "as_HTML"'s normal behavior of amp-quot
ing text segments.
For expample, this:

my $literal = HTML::Element->new('~literal',
'text' => 'x < 4 & y > 7'
);
my $span = HTML::Element->new('span');
$span->push_content($literal);
print $span->as_HTML;
prints this:

<span>x < 4 & y > 7</span>
Whereas this:

my $span = HTML::Element->new('span');
$span->push_content('x < 4 & y > 7');
# normal text segment
print $span->as_HTML;
prints this:

<span>x &lt; 4 &amp; y &gt; 7</span>
Unless you're inserting lots of pre-cooked code
into existing trees, and dumping them out again,
it's not likely that you'll find "~literal"
pseudo-elements useful.
$h->parent() or $h->parent($new_parent)
Returns (optionally sets) the parent (aka "container")
for this element. The parent should either be undef,
or should be another element.
You should not use this to directly set the parent of an element. Instead use any of the other methods
under "Structure-Modifying Methods", below.
Note that not($h->parent) is a simple test for whether
$h is the root of its subtree.
$h->content_list()
Returns a list of the child nodes of this element -i.e., what nodes (elements or text segments) are
inside/under this element. (Note that this may be an
empty list.)
In a scalar context, this returns the count of the
items, as you may expect.
$h->content()
This somewhat deprecated method returns the content of
this element; but unlike content_list, this returns
either undef (which you should understand to mean no
content), or a reference to the array of content items, each of which is either a text segment (a
string, i.e., a defined non-reference scalar value),
or an HTML::Element object. Note that even if an
arrayref is returned, it may be a reference to an
empty array.
While older code should feel free to continue to use
$h->content, new code should use $h->content_list in
almost all conceivable cases. It is my experience
that in most cases this leads to simpler code anyway,
since it means one can say:

@children = $h->content_list;
instead of the inelegant:

@children = @{$h->content || []};
If you do use $h->content (or $h->content_array_ref),
you should not use the reference returned by it
(assuming it returned a reference, and not undef) to
directly set or change the content of an element or
text segment! Instead use "content_refs_list" or any
of the other methods under "Structure-Modifying Meth
ods", below.
$h->content_array_ref()
This is like "content" (with all its caveats and dep
recations) except that it is guaranteed to return an
array reference. That is, if the given node has no
"_content" attribute, the "content" method would
return that undef, but "content_array_ref" would set
the given node's "_content" value to "[]" (a reference
to a new, empty array), and return that.
$h->content_refs_list
This returns a list of scalar references to each ele
ment of $h's content list. This is useful in case you
want to in-place edit any large text segments without
having to get a copy of the current value of that seg
ment value, modify that copy, then use the
"splice_content" to replace the old with the new.
Instead, here you can in-place edit:

foreach my $item_r ($h->content_refs_list) {
next if ref $$item_r;
$$item_r =~ s/honour/honor/g;
}
You could currently achieve the same affect with:

foreach my $item (@{ $h->content_array_ref }) {
# deprecated!
next if ref $item;
$item =~ s/honour/honor/g;
}
...except that using the return value of $h->content
or $h->content_array_ref to do that is deprecated, and
just might stop working in the future.
$h->implicit() or $h->implicit($bool)
Returns (optionally sets) the "_implicit" attribute.
This attribute is a flag that's used for indicating
that the element was not originally present in the
source, but was added to the parse tree (by
HTML::TreeBuilder, for example) in order to conform to
the rules of HTML structure.
$h->pos() or $h->pos($element)
Returns (and optionally sets) the "_pos" (for "current
position") pointer of $h. This attribute is a pointer
used during some parsing operations, whose value is
whatever HTML::Element element at or under $h is cur
rently "open", where $h->insert_element(NEW) will
actually insert a new element.
(This has nothing to do with the Perl function called
"pos", for controlling where regular expression match
ing starts.)
If you set $h->pos($element), be sure that $element is
either $h, or an element under $h.
If you've been modifying the tree under $h and are no
longer sure $h->pos is valid, you can enforce validity
with:

$h->pos(undef) unless $h->pos->is_inside($h);
$h->all_attr()
Returns all this element's attributes and values, as
key-value pairs. This will include any "internal"
attributes (i.e., ones not present in the original
element, and which will not be represented if/when you
call $h->as_HTML). Internal attributes are distin
guished by the fact that the first character of their
key (not value! key!) is an underscore ("_").
Example output of "$h->all_attr()" : "'_parent',
"[object_value]" , '_tag', 'em', 'lang', 'en-US', '_content', "[array-ref value].
$h->all_attr_names()
Like all_attr, but only returns the names of the
attributes.
Example output of "$h->all_attr()" : "'_parent',
'_tag', 'lang', '_content', ".
$h->all_external_attr()
Like "all_attr", except that internal attributes are
not present.
$h->all_external_attr_names()
Like "all_external_attr_names", except that internal
attributes' names are not present.
$h->id() or $h->id($string)
Returns (optionally sets to $string) the "id"
attribute. "$h->id(undef)" deletes the "id"
attribute.
$h->idf() or $h->idf($string)
Just like the "id" method, except that if you call
"$h->idf()" and no "id" attribute is defined for this
element, then it's set to a likely-to-be-unique value,
and returned. (The "f" is for "force".)

STRUCTURE-MODIFYING METHODS

These methods are provided for modifying the content of
trees by adding or changing nodes as parents or children
of other nodes.

$h->push_content($element_or_text, ...)
Adds the specified items to the end of the content
list of the element $h. The items of content to be
added should each be either a text segment (a string),
an HTML::Element object, or an arrayref. Arrayrefs
are fed thru "$h->new_from_lol(that_arrayref)" to con
vert them into elements, before being added to the
content list of $h. This means you can say things
concise things like:

$body->push_content(
['br'],
['ul',
map ['li', $_]
qw(Peaches Apples Pears Mangos)
]
);
See "new_from_lol" method's documentation, far below,
for more explanation.
The push_content method will try to consolidate adja
cent text segments while adding to the content list.
That's to say, if $h's content_list is

('foo bar ', $some_node, 'baz!')
and you call

$h->push_content('quack?');
then the resulting content list will be this:

('foo bar ', $some_node, 'baz!quack?')
and not this:

('foo bar ', $some_node, 'baz!', 'quack?')
If that latter is what you want, you'll have to over
ride the feature of consolidating text by using
splice_content, as in:

$h->splice_content(scalar($h->con
tent_list),0,'quack?');
Similarly, if you wanted to add 'Skronk' to the begin
ning of the content list, calling this:

$h->unshift_content('Skronk');
then the resulting content list will be this:

('Skronkfoo bar ', $some_node, 'baz!')
and not this:

('Skronk', 'foo bar ', $some_node, 'baz!')
What you'd to do get the latter is:

$h->splice_content(0,0,'Skronk');
$h->unshift_content($element_or_text, ...)
Just like "push_content", but adds to the beginning of the $h element's content list.
The items of content to be added should each be either
a text segment (a string), an HTML::Element object, or
an arrayref (which is fed thru "new_from_lol").
The unshift_content method will try to consolidate
adjacent text segments while adding to the content
list. See above for a discussion of this.
$h->splice_content($offset, $length, $element_or_text,
...)
Detaches the elements from $h's list of content-nodes,
starting at $offset and continuing for $length items,
replacing them with the elements of the following
list, if any. Returns the elements (if any) removed
from the content-list. If $offset is negative, then
it starts that far from the end of the array, just
like Perl's normal "splice" function. If $length and
the following list is omitted, removes everything from
$offset onward.
The items of content to be added (if any) should each
be either a text segment (a string), an arrayref
(which is fed thru "new_from_lol"), or an HTML::Ele
ment object that's not already a child of $h.
$h->detach()
This unlinks $h from its parent, by setting its 'par
ent' attribute to undef, and by removing it from the
content list of its parent (if it had one). The
return value is the parent that was detached from (or
undef, if $h had no parent to start with). Note that
neither $h nor its parent are explicitly destroyed.
$h->detach_content()
This unlinks $h all of $h's children from $h, and
returns them. Note that these are not explicitly
destroyed; for that, you can just use $h->delete_con
tent.
$h->replace_with( $element_or_text, ... )
This replaces $h in its parent's content list with the
nodes specified. The element $h (which by then may
have no parent) is returned. This causes a fatal
error if $h has no parent. The list of nodes to
insert may contain $h, but at most once. Aside from
that possible exception, the nodes to insert should
not already be children of $h's parent.
Also, note that this method does not destroy $h -- use
$h->replace_with(...)->delete if you need that.
$h->preinsert($element_or_text...)
Inserts the given nodes right BEFORE $h in $h's par
ent's content list. This causes a fatal error if $h
has no parent. None of the given nodes should be $h
or other children of $h. Returns $h.
$h->postinsert($element_or_text...)
Inserts the given nodes right AFTER $h in $h's par
ent's content list. This causes a fatal error if $h
has no parent. None of the given nodes should be $h
or other children of $h. Returns $h.
$h->replace_with_content()
This replaces $h in its parent's content list with its
own content. The element $h (which by then has no
parent or content of its own) is returned. This
causes a fatal error if $h has no parent. Also, note
that this does not destroy $h -- use
$h->replace_with_content->delete if you need that.
$h->delete_content()
Clears the content of $h, calling $i->delete for each
content element. Compare with $h->detach_content.
Returns $h.
$h->delete()
Detaches this element from its parent (if it has one)
and explicitly destroys the element and all its
descendants. The return value is undef.
Perl uses garbage collection based on reference count
ing; when no references to a data structure exist,
it's implicitly destroyed -- i.e., when no value any
where points to a given object anymore, Perl knows it
can free up the memory that the now-unused object
occupies.
But this fails with HTML::Element trees, because a
parent element always holds references to its chil
dren, and its children elements hold references to the
parent, so no element ever looks like it's not in use.
So, to destroy those elements, you need to call
$h->delete on the parent.
$h->clone()
Returns a copy of the element (whose children are
clones (recursively) of the original's children, if
any).
The returned element is parentless. Any '_pos'
attributes present in the source element/tree will be
absent in the copy. For that and other reasons, the
clone of an HTML::TreeBuilder object that's in midparse (i.e, the head of a tree that HTML::TreeBuilder
is elaborating) cannot (currently) be used to continue
the parse.
You are free to clone HTML::TreeBuilder trees, just as
long as: 1) they're done being parsed, or 2) you don't
expect to resume parsing into the clone. (You can
continue parsing into the original; it is never
affected.)
HTML::Element->clone_list(...nodes...)
or: ref($h)->clone_list(...nodes...)
Returns a list consisting of a copy of each node
given. Text segments are simply copied; elements are
cloned by calling $it->clone on each of them.
$h->normalize_content
Normalizes the content of $h -- i.e., concatenates any
adjacent text nodes. (Any undefined text segments are
turned into empty-strings.) Note that this does not
recurse into $h's descendants.
$h->delete_ignorable_whitespace()
This traverses under $h and deletes any text segments
that are ignorable whitespace. You should not use
this if $h under a 'pre' element.
$h->insert_element($element, $implicit)
Inserts (via push_content) a new element under the
element at $h->pos(). Then updates $h->pos() to point to the inserted element, unless $element is a proto
typically empty element like "br", "hr", "img", etc.
The new $h->pos() is returned. This method is useful
only if your particular tree task involves setting
$h->pos.

DUMPING METHODS

$h->dump()
$h->dump(*FH) ; # or *FH{IO} or $fh_obj
Prints the element and all its children to STDOUT (or
to a specified filehandle), in a format useful only
for debugging. The structure of the document is shown
by indentation (no end tags).
$h->as_HTML() or $h->as_HTML($entities)
or $h->as_HTML($entities, $indent_char)
or $h->as_HTML($entities, $indent_char,
optional_end_tags)
Returns a string representing in HTML the element and
its descendants. The optional argument $entities
specifies a string of the entities to encode. For
compatibility with previous versions, specify '<>&'
here. If omitted or undef, all unsafe characters are
encoded as HTML entities. See HTML::Entities for
details.
If $indent_char is specified and defined, the HTML to
be output is intented, using the string you specify
(which you probably should set to "", or some number
of spaces, if you specify it). This feature is cur
rently somewhat experimental. But try it, and feel
free to email me any bug reports. (Note that output,
although indented, is not wrapped. Patches welcome.)
If "optional_end_tags" is specified and defined, it
should be a reference to a hash that holds a true
value for every tag name whose end tag is optional.
Defaults to "HTML::Element::optionalEndTag", which
is an alias to %HTML::Tagset::optionalEndTag, which,
at time of writing, contains true values for "p, li,
dt, dd". A useful value to pass is an empty hashref,
"{}", which means that no end-tags are optional for
this dump. Otherwise, possibly consider copying
%HTML::Tagset::optionalEndTag to a hash of your own,
adding or deleting values as you like, and passing a
reference to that hash.
$h->as_text()
$h->as_text(skip_dels => 1)
Returns a string consisting of only the text parts of
the element's descendants.
Text under 'script' or 'style' elements is never
included in what's returned. If "skip_dels" is true,
then text content under "del" nodes is not included in
what's returned.
$h->as_trimmed_text(...)
This is just like as_text(...) except that leading and
trailing whitespace is deleted, and any internal
whitespace is collapsed.
$h->as_XML()
Returns a string representing in XML the element and
its descendants. The optional argument $entities
specifies a string of the entities to encode. If
omitted or undef, all unsafe characters are encoded as
numeric XML entities.
This method is experimental. If you use this method,
tell me what you use it for, and if you run into any
problems.
The XML is not indented.
$h->as_Lisp_form()
Returns a string representing the element and its
descendants as a Lisp form. Unsafe characters are
encoded as octal escapes.
The Lisp form is indented, and contains external
("href", etc.) as well as internal attributes
("_tag", "_content", "_implicit", etc.), except for
"_parent", which is omitted.
Current example output for a given element:

("_tag" "img" "border" "0" "src" "pie.png" "usemap"
"#main.map")
This method is experimental. If you use this method,
tell me what you use it for, and if you run into any
problems.
$h->starttag() or $h->starttag($entities)
Returns a string representing the complete start tag
for the element. I.e., leading "<", tag name,
attributes, and trailing ">". Attributes values that
don't consist entirely of digits are surrounded with
double-quotes, and appropriate characters are encoded.
If $entities is omitted or undef, all unsafe charac
ters are encoded as HTML entities. See HTML::Entities
for details. If you specify some value for $entities,
remember to include the double-quote character in it.
(Previous versions of this module would basically
behave as if '&">' were specified for $entities.)
$h->endtag()
Returns a string representing the complete end tag for
this element. I.e., "</", tag name, and ">".

SECONDARY STRUCTURAL METHODS

These methods all involve some structural aspect of the
tree; either they report some aspect of the tree's struc
ture, or they involve traversal down the tree, or walking
up the tree.

$h->is_inside('tag', ...) or $h->is_inside($element, ...)
Returns true if the $h element is, or is contained
anywhere inside an element that is any of the ones
listed, or whose tag name is any of the tag names
listed.
$h->is_empty()
Returns true if $h has no content, i.e., has no ele
ments or text segments under it. In other words, this
returns true if $h is a leaf node, AKA a terminal
node. Do not confuse this sense of "empty" with
another sense that it can have in SGML/HTML/XML termi
nology, which means that the element in question is of
the type (like HTML's "hr", "br", "img", etc.) that
can't have any content.
That is, a particular "p" element may happen to have
no content, so $that_p_element->is_empty will be true
-- even though the prototypical "p" element isn't
"empty" (not in the way that the prototypical "hr"
element is).
If you think this might make for potentially confusing
code, consider simply using the clearer exact equiva
lent: not($h->content_list)
$h->pindex()
Return the index of the element in its parent's con
tents array, such that $h would equal

$h->parent->content->[$h->pindex]
or
($h->parent->content_list)[$h->pindex]
assuming $h isn't root. If the element $h is root,
then $h->pindex returns undef.
$h->left()
In scalar context: returns the node that's the immedi
ate left sibling of $h. If $h is the leftmost (or
only) child of its parent (or has no parent), then
this returns undef.
In list context: returns all the nodes that're the
left siblings of $h (starting with the leftmost). If
$h is the leftmost (or only) child of its parent (or
has no parent), then this returns empty-list.
(See also $h->preinsert(LIST).)
$h->right()
In scalar context: returns the node that's the immedi
ate right sibling of $h. If $h is the rightmost (or
only) child of its parent (or has no parent), then
this returns undef.
In list context: returns all the nodes that're the
right siblings of $h, starting with the leftmost. If
$h is the rightmost (or only) child of its parent (or
has no parent), then this returns empty-list.
(See also $h->postinsert(LIST).)
$h->address()
Returns a string representing the location of this
node in the tree. The address consists of numbers
joined by a '.', starting with '0', and followed by
the pindexes of the nodes in the tree that are ances
tors of $h, starting from the top.
So if the way to get to a node starting at the root is
to go to child 2 of the root, then child 10 of that,
and then child 0 of that, and then you're there -then that node's address is "0.2.10.0".
As a bit of a special case, the address of the root is
simply "0".
I forsee this being used mainly for debugging, but you
may find your own uses for it.
$h->address($address)
This returns the node (whether element or text-seg
ment) at the given address in the tree that $h is a
part of. (That is, the address is resolved starting
from $h->root.)
If there is no node at the given address, this returns
undef.
You can specify "relative addressing" (i.e., that
indexing is supposed to start from $h and not from
$h->root) by having the address start with a period -e.g., $h->address(".3.2") will look at child 3 of $h,
and child 2 of that.
$h->depth()
Returns a number expressing $h's depth within its
tree, i.e., how many steps away it is from the root.
If $h has no parent (i.e., is root), its depth is 0.
$h->root()
Returns the element that's the top of $h's tree. If
$h is root, this just returns $h. (If you want to
test whether $h is the root, instead of asking what
its root is, just test not($h->parent).)
$h->lineage()
Returns the list of $h's ancestors, starting with its
parent, and then that parent's parent, and so on, up
to the root. If $h is root, this returns an empty
list.
If you simply want a count of the number of elements
in $h's lineage, use $h->depth.
$h->lineage_tag_names()
Returns the list of the tag names of $h's ancestors,
starting with its parent, and that parent's parent,
and so on, up to the root. If $h is root, this
returns an empty list. Example output: "('em', 'td',
'tr', 'table', 'body', 'html')"
$h->descendants()
In list context, returns the list of all $h's descen
dant elements, listed in pre-order (i.e., an element
appears before its content-elements). Text segments
DO NOT appear in the list. In scalar context, returns
a count of all such elements.
$h->find_by_tag_name('tag', ...)
In list context, returns a list of elements at or
under $h that have any of the specified tag names. In
scalar context, returns the first (in pre-order
traversal of the tree) such element found, or undef if
none.
$h->find_by_attribute('attribute', 'value')
In a list context, returns a list of elements at or
under $h that have the specified attribute, and have
the given value for that attribute. In a scalar con
text, returns the first (in pre-order traversal of the
tree) such element found, or undef if none.
This method is deprecated in favor of the more expres sive "look_down" method, which new code should use
instead.
$h->look_down( ...criteria... )
This starts at $h and looks thru its element descen
dants (in pre-order), looking for elements matching
the criteria you specify. In list context, returns
all elements that match all the given criteria; in
scalar context, returns the first such element (or
undef, if nothing matched).
There are two kinds of criteria you can specify:
(attr_name, attr_value)
This means you're looking for an element with that
value for that attribute. Example: "alt", "pix!".
Consider that you can search on internal attribute
values too: "_tag", "p".
a coderef
This means you're looking for elements where
coderef->(each_element) returns true. Example:

my @wide_pix_images
= $h->look_down(
"_tag", "img",
"alt", "pix!",
sub { $_[0]->attr('width') >
350 }
);
Note that "(attr_name, attr_value)" criteria are
faster than coderef criteria, so should presumably be
put before them in your list of criteria. That is, in
the example above, the sub ref is called only for ele
ments that have already passed the criteria of having
a "_tag" attribute with value "img", and an "alt"
attribute with value "pix!". If the coderef were
first, it would be called on every element, and then
what elements pass that criterion (i.e., elements for
which the coderef returned true) would be checked for
their "_tag" and "alt" attributes.
Note that comparison of string attribute-values
against the string value in "(attr_name, attr_value)"
is case-INsensitive! A criterion of "('align',
'right')" will match an element whose "align" value is
"RIGHT", or "right" or "rIGhT", etc.
Note also that "look_down" considers "" (empty-string)
and undef to be different things, in attribute values.
So this:

$h->look_down("alt", "")
will find elements with an "alt" attribute, but where
the value for the "alt" attribute is "". But this:

$h->look_down("alt", undef)
is the same as:

$h->look_down(sub { !defined($_[0]->attr('alt')) } )
That is, it finds elements that do not have an "alt"
attribute at all (or that do have an "alt" attribute,
but with a value of undef -- which is not normally
possible).
Note that when you give several criteria, this is
taken to mean you're looking for elements that match
all your criterion, not just any of them. In other
words, there is an implicit "and", not an "or". So if
you wanted to express that you wanted to find elements
with a "name" attribute with the value "foo" or with
an "id" attribute with the value "baz", you'd have to
do it like:

@them = $h->look_down(
sub {
# the lcs are to fold case
lc($_[0]->attr('name')) eq 'foo'
or lc($_[0]->attr('id')) eq 'baz'
}
);
Coderef criteria are more expressive than "(attr_name,
attr_value)" criteria, and all "(attr_name,
attr_value)" criteria could be expressed in terms of
coderefs. However, "(attr_name, attr_value)" criteria
are a convenient shorthand. (In fact, "look_down"
itself is basically "shorthand" too, since anything
you can do with "look_down" you could do by traversing
the tree, either with the "traverse" method or with a
routine of your own. However, "look_down" often makes
for very concise and clear code.)
$h->look_up( ...criteria... )
This is identical to $h->look_down, except that
whereas $h->look_down basically scans over the list:

($h, $h->descendants)
$h->look_up instead scans over the list

($h, $h->lineage)
So, for example, this returns all ancestors of $h
(possibly including $h itself) that are "td" elements
with an "align" attribute with a value of "right" (or
"RIGHT", etc.):

$h->look_up("_tag", "td", "align", "right");
$h->traverse(...options...)
Lengthy discussion of HTML::Element's unnecessary and
confusing "traverse" method has been moved to a sepa
rate file: HTML::Element::traverse
$h->attr_get_i('attribute')
In list context, returns a list consisting of the val
ues of the given attribute for $self and for all its
ancestors starting from $self and working its way up.
Nodes with no such attribute are skipped.
("attr_get_i" stands for "attribute get, with inheri
tance".) In scalar context, returns the first such
value, or undef if none.
Consider a document consisting of:

<html lang='i-klingon'>
<head><title>Pati Pata</title></head>
<body>
<h1 lang='la'>Stuff</h1>
<p lang='es-MX' align='center'>
Foo bar baz <cite>Quux</cite>.
</p>
<p>Hooboy.</p>
</body>
</html>
If $h is the "cite" element, $h->attr_get_i("lang") in
list context will return the list ('es-MX', 'i-klin
gon'). In scalar context, it will return the value
'es-MX'.
If you call with multiple attribute names...
$h->attr_get_i('a1', 'a2', 'a3')
...in list context, this will return a list consisting
of the values of these attributes which exist in $self
and its ancestors. In scalar context, this returns
the first value (i.e., the value of the first existing
attribute from the first element that has any of the
attributes listed). So, in the above example,

$h->attr_get_i('lang', 'align');
will return:

('es-MX', 'center', 'i-klingon') # in list context
or
'es-MX' # in scalar context.
But note that this:

$h->attr_get_i('align', 'lang');
will return:

('center', 'es-MX', 'i-klingon') # in list context
or
'center' # in scalar context.
$h->tagname_map()
Scans across $h and all its descendants, and makes a
hash (a reference to which is returned) where each
entry consists of a key that's a tag name, and a value
that's a reference to a list to all elements that have
that tag name. I.e., this method returns:

{
# Across $h and all descendants...
'a' => [ ...list of all 'a' elements... ],
'em' => [ ...list of all 'em' elements... ],
'img' => [ ...list of all 'img' elements... ],
}
(There are entries in the hash for only those tagnames
that occur at/under $h -- so if there's no "img" ele
ments, there'll be no "img" entry in the hashr(ref)
returned.)
Example usage:

my $map_r = $h->tagname_map();
my @heading_tags = sort grep m/^h$/s, keys
%$map_r;
if(@heading_tags) {
print "Heading levels used: @heading_tags0;
} else {
print "No headings.0
}
$h->extract_links() or $h->extract_links(@wantedTypes)
Returns links found by traversing the element and all
of its children and looking for attributes (like
"href" in an "a" element, or "src" in an "img" ele
ment) whose values represent links. The return value
is a reference to an array. Each element of the array is reference to an array with four items: the
link-value, the element that has the attribute with
that link-value, and the name of that attribute, and
the tagname of that element. (Example:
"['http://www.suck.com/'," $elem_obj ", 'href',
'a']".) You may or may not end up using the element
itself -- for some purposes, you may use only the link
value.
You might specify that you want to extract links from
just some kinds of elements (instead of the default,
which is to extract links from all the kinds of ele
ments known to have attributes whose values represent
links). For instance, if you want to extract links
from only "a" and "img" elements, you could code it
like this:

for (@{ $e->extract_links('a', 'img') }) {
my($link, $element, $attr, $tag) = @$_;
print
"Hey, there's a $tag that links to "
$link, ", in its $attr attribute, at ",
$element->address(), ".0;
}
$h->same_as($i)
Returns true if $h and $i are both elements represent
ing the same tree of elements, each with the same tag
name, with the same explicit attributes (i.e., not
counting attributes whose names start with "_"), and
with the same content (textual, comments, etc.).
Sameness of descendant elements is tested, recur
sively, with "$child1->same_as($child_2)", and same
ness of text segments is tested with "$segment1 eq
$segment2".
$h = HTML::Element->new_from_lol(ARRAYREF)
Resursively constructs a tree of nodes, based on the
(non-cyclic) data structure represented by ARRAYREF,
where that is a reference to an array of arrays (of
arrays (of arrays (etc.))).
In each arrayref in that structure, different kinds of
values are treated as follows:
· Arrayrefs
Arrayrefs are considered to designate a sub-tree
representing children for the node constructed
from the current arrayref.
· Hashrefs
Hashrefs are considered to contain attribute-value
pairs to add to the element to be constructed from
the current arrayref
· Text segments
Text segments at the start of any arrayref will be
considered to specify the name of the element to
be constructed from the current araryref; all
other text segments will be considered to specify
text segments as children for the current
arrayref.
· Elements
Existing element objects are either inserted into
the treelet constructed, or clones of them are.
That is, when the lol-tree is being traversed and
elements constructed based what's in it, if an
existing element object is found, if it has no
parent, then it is added directly to the treelet
constructed; but if it has a parent, then
"$that_node->clone" is added to the treelet at the
appropriate place.
An example will hopefully make this more obvious:

my $h = HTML::Element->new_from_lol(
['html',
['head',
[ 'title', 'I like stuff!' ],
],
['body',
{'lang', 'en-JP', _implicit => 1},
'stuff',
['p', 'um, p < 4!', {'class' => 'par123'}],
['div', {foo => 'bar'}, '123'],
]
]
);
$h->dump;
Will print this:

<html> @0
<head> @0.0
<title> @0.0.0
"I like stuff!"
<body lang="en-JP"> @0.1 (IMPLICIT)
"stuff"
<p class="par123"> @0.1.1
"um, p < 4!"
<div foo="bar"> @0.1.2
"123"
And printing $h->as_HTML will give something like:

<html><head><title>I like stuff!</title></head>
<body lang="en-JP">stuff<p class="par123">um, p &lt;
4!
<div foo="bar">123</div></body></html>
You can even do fancy things with "map":

$body->push_content(
# push_content implicitly calls new_from_lol on
arrayrefs...
['br'],
['blockquote',
['h2', 'Pictures!'],
map ['p', $_],
$body2->look_down("_tag", "img"),
# images, to be copied from that other tree.
],
# and more stuff:
['ul',
map ['li', ['a', {'href'=>"$_.png"}, $_ ] ],
qw(Peaches Apples Pears Mangos)
],
);
@elements = HTML::Element->new_from_lol(ARRAYREFS)
Constructs several elements, by calling new_from_lol for every arrayref in the ARRAYREFS list.

@elements = HTML::Element->new_from_lol(
['hr'],
['p', 'And there, on the door, was a hook!'],
);
# constructs two elements.
$h->objectify_text()
This turns any text nodes under $h from mere text seg
ments (strings) into real objects, pseudo-elements
with a tag-name of "~text", and the actual text con
tent in an attribute called "text". (For a discussion
of pseudo-elements, see the "tag" method, far above.)
This method is provided because, for some purposes, it
is convenient or necessary to be able, for a given
text node, to ask what element is its parent; and
clearly this is not possible if a node is just a text
string.
Note that these "~text" objects are not recognized as
text nodes by methods like as_text. Presumably you
will want to call $h->objectify_text, perform whatever
task that you needed that for, and then call $h->deob
jectify_text before calling anything like $h->as_text.
This method is experimental, and you are encouraged to
report any problems you encounter with it.
$h->deobjectify_text()
This undoes the effect of $h->objectify_text. That
is, it takes any "~text" pseudo-elements in the tree
at/under $h, and deletes each one, replacing each with
the content of its "text" attribute.
Note that if $h itself is a "~text" pseudo-element, it
will be destroyed -- a condition you may need to treat
specially in your calling code (since it means you
can't very well do anything with $h after that). So
that you can detect that condition, if $h is itself a
"~text" pseudo-element, then this method returns the
value of the "text" attribute, which should be a
defined value; in all other cases, it returns undef.
This method is experimental, and you are encouraged to
report any problems you encounter with it.
(This method assumes that no "~text" pseudo-element
has any children.)
$h->number_lists()
For every UL, OL, DIR, and MENU element at/under $h,
this sets a "_bullet" attribute for every child LI
element. For LI children of an OL, the "_bullet"
attribute's value will be something like "4.", "d.",
"D.", "IV.", or "iv.", depending on the OL element's
"type" attribute. LI children of a UL, DIR, or MENU
get their "_bullet" attribute set to "*". There
should be no other LIs (i.e., except as children of
OL, UL, DIR, or MENU elements), and if there are, they
are unaffected.
$h->has_insane_linkage
This method is for testing whether this element or the
elements under it have linkage attributes (_parent and
_content) whose values are deeply aberrant: if there
are undefs in a content list; if an element appears in
the content lists of more than one element; if the
_parent attribute of an element doesn't match its
actual parent; or if an element appears as its own
descendant (i.e., if there is a cyclicity in the
tree).
This returns empty list (or false, in scalar context)
if the subtree's linkage methods are sane; otherwise
it returns two items (or true, in scalar context): the
element where the error occurred, and a string
describing the error.
This method is provided is mainly for debugging and
troubleshooting -- it should be quite impossible for any document constructed via HTML::TreeBuilder to
parse into a non-sane tree (since it's not the content
of the tree per se that's in question, but whether the
tree in memory was properly constructed); and it
should be impossible for you to produce an insane tree just thru reasonable use of normal documented struc
ture-modifying methods. But if you're constructing
your own trees, and your program is going into infi
nite loops as during calls to traverse() or any of the secondary structural methods, as part of debugging,
consider calling is_insane on the tree.

BUGS

* If you want to free the memory associated with a tree
built of HTML::Element nodes, then you will have to delete
it explicitly. See the $h->delete method, above.

* There's almost nothing to stop you from making a "tree"
with cyclicities (loops) in it, which could, for example,
make the traverse method go into an infinite loop. So
don't make cyclicities! (If all you're doing is parsing
HTML files, and looking at the resulting trees, this will
never be a problem for you.)

* There's no way to represent comments or processing
directives in a tree with HTML::Elements. Not yet, at
least.

* There's (currently) nothing to stop you from using an
undefined value as a text segment. If you're running
under "perl -w", however, this may make HTML::Element's
code produce a slew of warnings.

NOTES ON SUBCLASSING

You are welcome to derive subclasses from HTML::Element,
but you should be aware that the code in HTML::Element
makes certain assumptions about elements (and I'm using
"element" to mean ONLY an object of class HTML::Element,
or of a subclass of HTML::Element):

* The value of an element's _parent attribute must either
be undef or otherwise false, or must be an element.

* The value of an element's _content attribute must either
be undef or otherwise false, or a reference to an
(unblessed) array. The array may be empty; but if it has
items, they must ALL be either mere strings (text seg
ments), or elements.

* The value of an element's _tag attribute should, at
least, be a string of printable characters.

Moreover, bear these rules in mind:

* Do not break encapsulation on objects. That is, access
their contents only thru $obj->attr or more specific meth
ods.

* You should think twice before completely overriding any
of the methods that HTML::Element provides. (Overriding
with a method that calls the superclass method is not so
bad, tho.)

SEE ALSO

HTML::Tree; HTML::TreeBuilder; HTML::AsSubs; HTML::Tagset;
and, for the morbidly curious, HTML::Element::traverse.

COPYRIGHT

Copyright 1995-1998 Gisle Aas, 1999-2001 Sean M. Burke.

This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be
useful, but without any warranty; without even the implied
warranty of merchantability or fitness for a particular
purpose.

AUTHOR

Original author Gisle Aas <gisle@aas.no>; current
maintainer Sean M. Burke, <sburke@cpan.org>
Copyright © 2010-2025 Platon Technologies, s.r.o.           Home | Man pages | tLDP | Documents | Utilities | About
Design by styleshout