lcwa(1)

NAME

LCWA - Last Changes Web Agent

SYNOPSIS

lcwa c[rawl] [-d file] [-c num] [-d num] [-r pattern] [-i
pattern] [-e pattern] [-b file] [-s file] [-l file] [-v
level] [url ...]
lcwa q[uery] [-d file] [-f format] [timespec ...]
lcwa [-V]

DESCRIPTION

Lcwa is a web agent which determines the last changes time
of documents in an Intranet's webcluster by crawling in a
fast way through its webareas via HTTP. It was written
with speed in mind, so it uses a variable number of preforked crawling clients which work in parallel. They are
coordinated via a server which implements a shared URL
stack and a common result pool.

Each client first pops off an URL from the stack, and
retrieves the document, determines its last changes time
and then sends this information back to the result pool.
Additionally if the currently fetched document is of MIME
type text/html, it parses out all anchor, image and
frameset hyperlinks it contains and pushes these back to
the shared URL stack, so they can be processed the next
time by itself or the other parallel running clients.

The HTTP crawling is done in optimized way: First the
request method (GET or HEAD) is determined from the URL
and the document fetched. If the method was HEAD and the
MIME-type now is text/html the document is requested again
with method GET. Then when the MTime cannot be determined
but the Server is an Apache one a third request is done to
retrieve the information with a possibly installed
mod_peephole if option -p is given.

OPTIONS

-D file: Sets the database file where the crawled information
is stored. Its format is one entry per line
consisting of an URL followed by a whitespace
followed by the corresponding timestamp (or an error
code when the timestamp couldn't determined).
-c num: Start num of crawling clients. Default is num=2, i.e. only one client which determines the last changes
times via HTTP crawling. If you specify more clients
the crawling is faster but because they are created
via pre-forking this is more memory consuming. Expect
3 MB RAM requirement per client.
-a type: This sets the crawling algorithm: type=d means depth
first crawling while type=w means width crawling
(default).
-d spec: This sets the URL path depth restriction, i.e. when
the path depth does not fall into spec the URL is not
crawled. Per default there is no such restriction.
Here are the accepted syntax variants for spec:

spec depth
min max

---- --- --N N N
>N N+1 oo
<N 1 N-1
N-M N M
+N X X+N
-N X X-N; To make it clear what the depth means here is an
example: the URL
http://foo.bar.com/any/url/to/data?some&query is of
depth 4 while
http://foo.bar.com/any/url/to/data/?some&query is of
depth 5. The rule is this: For depth only the path
part of an URL counts and here each slash increased
the depth, starting with depth=1 for the root-path /.
-r pattern: This restricts the crawled URLs by a regular
expression pattern, i.e. only URLs are pushed back to
the shared URL stack for further processing which
match ALL THOSE pattern.
-i pattern: This restricts the crawled URLs by a regular
expression pattern, i.e. URLs which match AT LEAST
ONE pattern are forced to be pushed back to the shared URL stack for further processing. If
NONE matches the URL is not accepted.
-e pattern: This restricts the crawled URLs by a regular
expression pattern, i.e. URLs NOT which match
ALL THOSE patterns are forced to be pushed back to the shared URL stack for further processing.
If ANY matches the URL is not accepted.
-b file: Sets the filename of the temporarily used URL
brainfile which is needed while crawling to determine
which URLs have been already seen.
-s file: Sets the filename of the temporarily used stack
swapfile which is needed while crawling to swap out
the in-core stack to avoid to huge memory
consumption.
-l file: Sets the filename of the logfile which shows
processing information of the crawling process. Only
interesting for debugging.
-v level: This sets verbose mode to level where some processing
information will be given on the console.
-V Displays the version string.

EXAMPLES

$ lcwa -c40 -r '^http://[a-zA
Z0-9._]+.sdm.de/' -i '.*.p?html$' -i '.*/$'
-d 4 -o sww.times.db
http://sww.sdm.de/

AUTHOR

Ralf S. Engelschall
rse@engelschall.com
www.engelschall.com

docs.sk

comprehensive documentation repository

Most Viewed