Strip(3pm)
NAME
HTML::Strip - Perl extension for stripping HTML markup from text.
SYNOPSIS
use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;
DESCRIPTION
This module simply strips HTML-like markup from text in a very quick
and brutal manner. It could quite easily be used to strip XML or SGML
from text as well; but removing HTML markup is a much more common
problem, hence this module lives in the HTML:: namespace.
It is written in XS, and thus about five times quicker than using
regular expressions for the same task.
It does not do any syntax checking (if you want that, use
HTML::Parser), instead it merely applies the following rules:
- 1. Anything that looks like a tag, or group of tags will be replaced
- with a single space character. Tags are considered to be anything
that starts with a "<" and ends with a ">"; with the caveat that a ">" character may appear in either of the following without ending the tag: - Quote
Quotes are considered to start with either a "'" or a """
character, and end with a matching character not preceded by an even number or escaping slashes (i.e. "\"" does not end the
quote but "\\\\"" does). - Comment
If the tag starts with an exclamation mark, it is assumed to be a declaration or a comment. Within such tags, ">" characters do not end the tag if they appear within pairs of double dashes
(e.g. "<!-- <a href="old.htm">old page</a> -->" would be
stripped completely). - 2. Anything the appears within so-called strip tags is stripped as
well. By default, these tags are "title", "script", "style" and
"applet". - HTML::Strip maintains state between calls, so you can parse a document
in chunks should you wish. If one chunk ends half-way through a tag,
quote, comment, or whatever; it will remember this, and expect the next call to parse to start with the remains of said tag. - If this is not going to be the case, be sure to call $hs->eof() between calls to $hs->parse().
- METHODS
new()
Constructor. Can optionally take a hash of settings (with keys
corresponsing to the "set_" methods below). - For example, the following is a valid constructor:
my $hs = HTML::Strip->new(
striptags => [ 'script', 'iframe' ],
emit_spaces => 0- );
- parse()
Takes a string as an argument, returns it stripped of HTML. - eof()
Resets the current state information, ready to parse a new block of HTML. - clear_striptags()
Clears the current set of strip tags. - add_striptag()
Adds the string passed as an argument to the current set of strip
tags. - set_striptags()
Takes a reference to an array of strings, which replace the current set of strip tags. - set_emit_spaces()
Takes a boolean value. If set to false, HTML::Strip will not
attempt any conversion of tags into spaces. Set to true by default. - set_decode_entities()
Takes a boolean value. If set to false, HTML::Strip will decode
HTML entities. Set to true by default. - LIMITATIONS
Whitespace
Despite only outputting one space character per group of tags, and avoiding doing so when tags are bordered by spaces or the start or end of strings, HTML::Strip can often output more than desired;
such as with the following HTML:
<h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>- Which gives the following output:
- " HTML::Strip fast, and brutal "
- Thus, you may want to post-filter the output of HTML::Strip to
remove excess whitespace (for example, using "tr/ / /s;"). (This
has been improved since previous releases, but is still an issue) - HTML Entities
HTML::Strip will only attempt decoding of HTML entities if
HTML::Entities is installed. - EXPORT
None by default.
AUTHOR
Alex Bowley <kilinrax@cpan.org>
SEE ALSO
- perl, HTML::Parser, HTML::Entities