w3mir(1)
NAME
w3mir - all purpose HTTP-copying and mirroring tool
SYNOPSIS
w3mir [options] [HTTP-URL] w3mir -B [options] <HTTP-URLS> w3mir is a all purpose HTTP copying and mirroring tool. The main focus of w3mir is to create and maintain a brows able copy of one, or several, remote WWW site(s). Used to the max w3mir can retrive the contents of several related sites and leave the mirror browseable via a local web server, or from a filesystem, such as directly from a CDROM. w3mir has options for all operations that are simple enough for options. For authentication and passwords, multiple site retrievals and such you will have to resort to a "CONFIGURATION-FILE". If browsing from a filesystem references ending in '/' needs to be rewritten to end in '/index.html', and in any case, if there are URLs that are redirected will need to be changed to make the mirror browseable, see the documentation of Fixup in the "CONFIG URATION-FILE" secton. w3mirs default behavior is to do as little as possible and to be as nice as possible to the server(s) it is getting documents from. You will need to read through the options list to make w3mir do more complex, and, useful things. Most of the things w3mir can do is also documented in the w3mir-HOWTO which is available at the w3mir home-page (http://www.math.uio.no/~janl/w3mir/) as well as in the w3mir distribution bundle.
DESCRIPTION
You may specify many options and one HTTP-URL on the w3mir
command line.
A single HTTP URL must be specified either on the command
line or in a URL directive in a configuration file. If
the URL refers to a directory it must end with a "/", oth
erwise you might get surprised at what gets retrieved
(e.g. rather more than you expect).
Options must be prefixed with at least one - as shown
below, you can use more if you want to. -cfgfile is equiv
alent to --cfgfile or even ------cfgfile. Options cannot
be clustered, i.e., -r -R is not equivalent to -rR.
- -h | -help | -?
- prints a brief summary of all command line options and
exits. - -cfgfile file
- Makes w3mir read the given configuration file. See
the next section for how to write such a file. - -r Puts w3mir into recursive mode. The default is to
- fetch only one document and then quit. 'recursive'
mode means that all the documents linked to the given
document that are fetched, and all they link to in
turn and so on. But only Iff they are in the same
directory or under the same directory as the start
document. Any document that is in or under the start
ing documents directory is said to be within the scope
of retrieval. - -fa Fetch All. Normally w3mir will only get the document
- if it has been updated since the last time it was
fetched. This switch turns that check off. - -fs Fetch Some. Not the opposite of -fa, but rather,
- fetch the ones we don't have already. This is handy
to restart copying of a site incompletely copied by
earlier, interrupted, runs of w3mir. - -p n
- Pause for n seconds between getting each document.
The default is 30 seconds. - -rp n
- Retry Pause, in seconds. When w3mir fails to get a
document for some technical reason (timeout mainly)
the document will be queued for a later retry. The
retry pause is how long w3mir waits between finishing
a mirror pass before starting a new one to get the
still missing documents. This should be a long time,
so network conditions have a chance to get better.
The default is 600 seconds (10 minutes), which might
be a bit too short, for batch running w3mir I would
suggest an hour (3600 seconds) or more. - -t n
- Number of reTries. If w3mir cannot get all the docu
ments by the nth retry w3mir gives up. The default is 3. - -drr
- Disable Robot Rules. The robot exclusion standard is
described in http://info.webcrawler.com/mak/pro
jects/robots/norobots.html. By default w3mir honors
this standard. This option causes w3mir to ignore it. - -nnc
- No Newline Conversion. Normally w3mir converts the
newline format of all files that the web server says
is a text file. However, not all web servers are
reliable, and so binary files may become corrupted due
to the newline conversion w3mir performs. Use this
option to stop w3mir from converting newlines. This
also causes the file to be regarded as binary when
written to disk, to disable the implicit newline con
version when saving text files on most non-Unix sys
tems. - This will probably be on by default in version 1.1 of
w3mir, but not in version 1.0. - -R Remove files. Normally w3mir will not remove files
- that are no longer on the server/part of the retrieved
web of files. When this option is specified all files
no longer needed or found on the servers will be
removed. If w3mir fails to get a document for any other reason the file will not be removed. - -B Batch fetch documents whose URLs are given on the
- commandline.
- In combination with the -r and/or -l switch all HTML
and PDF documents will be mined for URLs, but the doc
uments will be saved on disk unchanged. When used
with the -r switch only one single URL is allowed.
When not used with the -r switch no HTML/URL process
ing will be performed at all. When the -B switch is
used with -r w3mir will not do repeated mirrorings
reliably since the changes w3mir needs to do, in the
documents, to work reliably are not done. In any case
it's best not to use -R in combination with -B since
that can result in deleting rather more documents than
expected. Hwowever, if the person writing the docu
ments being copied is good about making references
relative and placing the <HTML> tag at the beginning
of documents there is a fair chance that things will
work even so. But I wouln't bet on it. It will, how
ever, work reliably for repeated mirroring if the -r
switch is not used. - When the -B switch is specified redirects for a given
document will be followed no matter where they point.
The redirected-to document will be retrieved in the
place of the original document. This is a potential
weakness, since w3mir can be directed to fetch any
document anywhere on the web. - Unless used with -r all retrived files will be stored
in one directory using the remote filename as the
local filename. I.e., http://foo/bar/gazonk.html will be saved as gazonk.html. http://foo/bar/ will be saved as bar-index.html so as to avoid name colitions for the common case of URLs ending in /. - -I This switch can only be used with the -B switch, and
- only after it on the commandline or configuration
file. When given w3mir will get URLs from standard
input (i.e., w3mir can be used as the end of a pipe
that produces URLs.) There should only be one URL pr.
line of input. - -q Quiet. Turns off all informational messages, only
- errors will be output.
- -c Chatty. w3mir will output more progress information.
- This can be used if you're watching w3mir work.
- -v Version. Output w3mirs version.
- -s Copy the given document(s) to STDOUT.
- -f Forget. The retrieved documents are not saved on
- disk, they are just forgotten. This can be used to
prime the cache in proxy servers, or not save docu
ments you just want to list the URLs in (see -l). - -l List the URLs referred to in the retrieved document(s)
- on STDOUT.
- -umask n
- Sets the umask, i.e., the permission bits of all
retrieved files. The number is taken as octal unless
it starts with a 0x, in which case it's taken as hex
adecimal. No matter what you set this to make sure
you get write as well as read access to created files
and directories. - Typical values are:
- 022 let everyone read the files (and directories),
only you can change them.
- 027 you and everyone in the same file-group as you
- can read, only you can change them.
- 077 only you can read the files, only you can
- change them.
- 0 everyone can read, write and change every
- thing.
- The default is whatever was set when w3mir was
invoked. 022 is a reasonable value. - This option has no meaning, or effect, on Win32 plat
forms. - -P server:port
- Use the given server and port is a HTTP proxy server.
If no port is given port 80 is assumed (this is the
normal HTTP port). This is useful if you are inside a
firewall, or use a proxy server to save bandwidth. - -pflush
- Proxy flush, force the proxy server to flush it's
cache and re-get the document from the source. The
Pragma: no-cache HTTP/1.0 header is used to implement this. - -ir referrer
- Initial Referrer. Set the referrer of the first
retrieved document. Some servers are reluctant to
serve certain documents unless this is set right. - -agent agent
- Set the HTTP User-Agent fields value. Some servers
will serve different documents according to the WWW
browsers capabilities. w3mir normally has w3mir/ver_ sion in this header field. Netscape uses things like
Mozilla/3.01 (X11; I; Linux 2.0.30 i586) and MSIE uses things like Mozilla/2.0 (compatible; MSIE 3.02; Win dows NT) (remember to enclose agent strings with
spaces in with double quotes (")) - -lc Lower Case URLs. Some OSes, like W95 and NT, are not
- case sensitive when it comes to filenames. Thus web
masters using such OSes can case filenames differently
in different places (apps.html, Apps.html, APPS.HTML).
If you mirror to a Unix machine this can result in one
file on the server becoming many in the mirror. This
option lowercases all filenames so the mirror corre
sponds better with the server. - If given it must be the first option on the command
line. - This option does not work perfectly. Most especially
for mixed case host-names. - -d n
- Set the debug level. A debug level higher than 0 will
produce lots of extra output for debugging purposes. - -abs
- Force all URLs to be absolute. If you retrive
http://www.ifi.uio.no/~janl/index.html and it refer ences foo.html the referense is absolutified into
http://www.ifi.uio.no/~janl/foo.html. In other words, you get absolute references to the origin site if you
use this option.
CONFIGURATION-FILE
Most things can be mirrored with a (long) command line.
But multi server mirroring, authentication and some other
things are only available through a configuration file. A
configuration file can either be specified with the -cfg
file switch, but w3mir also looks for .w3mirc (w3mir.ini
on Win32 platforms) in the directory where w3mir is
started from.
- The configuration file consists of lines of comments and
directives. A directive consists of a keyword followed by
a colon (:) and then one or several arguments. - # This is a comment. And the next line is a directive:
Options: recurse, remove - A comment can only start at the beginning of a line. The
directive keywords are not case-sensitive, but the argu
ments might be. - Options: recurse | no-date-check | only-nonexistent list-urls | lowercase | remove | batch | input-urls | nonewline-conv | list-nonmirrored
- This must be the first directive in a configuration
file. - recurse see -r switch.
- no-date-check
see -fa switch.
- only-nonexistent
- see -fs switch.
- list-urls
- see -l option.
- lowercase
- see -lc option.
- remove see -R option.
- batch see -B option.
- input-urls
- see -I option.
- no-newline-conv
- see -nnc option.
- list-nonmirrored
- List URLs not mirrored in a file called .not
mirrored ('notmir' on win32). It will contain
a lot of duplicate lines and quite possebly be
quite large. - URL: HTTP-URL [target-directory]
- The URL directive may only appear once in any configu
ration file. - Without the optional target directory argument it cor
responds directly to the single-HTTP-URL argument on the command line. - If the optional target directory is given all docu
ments from under the given URL will be stored in that
directory, and under. The target directory is most
likely only specified if the Also directive is also
specified. - If the URL given refers to a directory it must end in
a "/", otherwise you might get quite surprised at what
gets retrieved. - Either one URL: directive or the single-HTTP-URL at
the command-line must be given. - Also: HTTP-URL directory
- This directive is only meaningful if the recurse (or -r) option is given.
- The directive enlarges the scope of a recursive
retrieval to contain the given HTTP-URL and all docu
ments in the same directory or under. Any documents
retrieved because of this directive will be stored in
the given directory of the mirror. - In practice this means that if the documents to be
retrieved are stored on several servers, or in several
hierarchies on one server or any combination of those.
Then the Also directive ensures that we get everything
into one single mirror. - This also means that if you're retrieving
URL: http://www.foo.org/gazonk/ - but it has inline icons or images stored in
http://www.foo.org/icons/ which you will also want to
get, then that will be retrieved as well by entering
Also: http://www.foo.org/icons/ icons - As with the URL directive, if the URL refers to a
directory it must end in a "/". - Another use for it is when mirroring sites that have
several names that all refer to the same (logical)
server:
URL: http://www.midifest.com/
Also: http://midifest.com/ . - At this point in time w3mir has no mechanism to easily
enlarge the scope of a mirror after it has been estab
lished. That means that you should survey the docu
ments you are going to retrieve to find out what
icons, graphics and other things they refer to that
you want. And what other sites you might like to
retrieve. If you find out that something is missing
you will have to delete the whole mirror, add the
needed Also directives and then reestablish the
mirror. This lack of flexibility in what to retrieve
will be addressed at a later date. - See also the Also-quene directive.
- Also-quene: HTTP-URL directory
- This is like Also, except that the URL itself is also
quened. The Also directive will not cause any docu
ments to be retrived UNLESS they are referenced by
some other document w3mir has already retrived. - Quene: HTTP-URL
- This is quenes the URL for retrival, but does not
enlarge the scope of the retrival. If the URL is out
side the scope of retrival it will not be retrived
anyway. - The observant reader will see that Also-quene is like Also combined with Quene.
- Initial-referer: referer
- see -ir option.
- Ignore: wildcard
Fetch: wildcard
Ignore-RE: regular-expression
Fetch-RE: regular-expression - These four are used to set up rules about which docu
ments, within the scope of retrieval, should be gotten
and which not. The default is to get anything that is within the scope of retrieval. That may not be prac
tical though. This goes for CGI scripts, and espe
cially server side image maps and other things that
are executed/evaluated on the server. There might be
other things you want unfetched as well. - w3mir stores the Ignore/Fetch rules in a list. When a
document is considered for retrieval the URL is
checked against the list in the same order that the
rules appeared in the configuration file. If the URL
matches any rule the search stops at once. If it
matched a Ignore rule the document is not fetched and
any URLs in other documents pointing to it will point
to the document at the original server (not inside the
mirror). If it matched a Fetch rule the document is
gotten. If not matched by any ruøes the document is
gotten. - The wildcards are a very limited subset of Unix-wild cards. w3mir understands only '?', '*', and '[x-y]' ranges.
- The perl-regular-expression is perls superset of the
normal Unix regular expression syntax. They must be
completely specified, including the prefixed m, a
delimiter of your choice (except the paired delim
iters: parenthesis, brackets and braces), and any of
the RE modifiers. E.g.,
Ignore-RE: m/.gif$/i - or
Ignore-RE: m~/.*/.*/.*/~ - and so on. "#" cannot be used as delimiter as it is
the comment character in the configuration file. This
also has the bad side-effect of making you unable to
match fragment names (#foobar) directly. Fortunately
perl allows writing ``#'' as `` 43''. - You must be very carefull of using the RE anchors
(``^'' and ``$'' with the RE versions of these and the
Apply directive. Given the rules:
Fetch-RE: m/foobar.cgi$/
Ignore: *.cgi - the all files called ``foobar.cgi'' will be fetched.
However, if the file is referenced as ``foo
bar.cgi?query=mp3'' it will not be fetched since the
``$'' anchor will prevent it from matching the FetchRE directive and then it will match the Ignore direc tive instead. If you want to match ``foobar.cgi'' but
not ``foobar.cgifu'' you can use perls `` charac
ter class which matches a word boundrary:
Fetch-RE: m/foobar.cgi
Ignore: *.cgi - which will get ``foobar.cgi'' as well as ``foo
bar.cgi?query=mp3'' but not ``foobar.cgifu''. BUT, you
must keep in mind that a lot of diffetent characters
make a word boundrary, maybe something more subtle is
needed. - Apply: regular-expression
- This is used to change a URL into another URL. It is
a potentially very powerful feature, and it also pro
vides ample chance for you to shoot your own foot. The
whole aparatus is somewhat tenative, if you find there
is a need for changes in how Apply rules work please
E-mail. If you are going to use this feature please
read the documentation for Fetch-RE and Ignore-RE first. - The Apply expressions are applied, in sequence, to the
URLs in their absolute form. I.e., with the whole
http://host:port/dir/ec/tory/file URL. It is only
after this w3mir checks if a document is within the
scope of retrieval or not. That means that Apply rules
can be used to change certain URLs to fall inside the
scope of retrieval, and vice versa. - The regular-expression is perls superset of the usual
Unix regular expressions for substitution. As with
Fetch and Ignore rules it must be specified fully, with the s and delimiting character. It has the same
restrictions with regards to delimiters. E.g.,
Apply: s~/foo/~/bar/~i - to translate the path element foo to bar in all URLs.
- "#" cannot be used as delimiter as it is the comment
character in the configuration file. - Please note that w3mir expects that URLs identifying
'directories' keep idenfifying directories after
application of Apply rules. Ditto for files. - Agent: agent
- see -agent option.
- Pause: n
- see -p option.
- Retry-Pause: n
- see -rp option.
- Retries: n
- see -t option.
- debug: n
- see -d option.
- umask n
- see -umask option.
- Robot-Rules: on | off
- Turn robot rules on of off. See -drr option.
- Remove-Nomirror: on | off
- If this is enabled sections between two consecutive
<!--NO MIRROR--> - comments in a mirrored document will be removed. This
editing is performed even if batch getting is speci
fied. - Header: html/text
- Insert this complete html/text into the start of the
document. This will be done even if batch is speci
fied. - File-Disposition: save | stdout | forget
- What to do with a retrieved file. The save alterna
tive is default. The two others correspond to the -s
and -f options. Only one may be specified. - Verbosity: quiet | brief | chatty
- How much w3mir informs you of it's progress. Brief is
the default. The two others correspond to the -q and
-c switches. - Cd: directory
- Change to given directory before starting work. If it
does not exist it will be quietly created. Using this
option breaks the 'fixup' code so consider not using
it, ever. - HTTP-Proxy: server:port
- see the -P switch.
- HTTP-Proxy-user: username
HTTP-Proxy-passwd: password - These two are is used to activate authentication with
the proxy server. w3mir only supports basic proxy
autentication, and is quite simpleminded about it, if
proxy authentication is on w3mir will always give it
to the proxy. The domain concept is not supported
with proxy-authentication. - Proxy-Options: no-pragma | revalidate | refresh | no-store
- Set proxy options. There are two ways to pass proxy
options, HTTP/1.0 compatible and HTTP/1.1 compatible.
Newer proxy-servers will understand the 1.1 way as
well as 1.0. With old proxy-servers only the 1.0 way
will work. w3mir will prefer the 1.0 way. - The only 1.0 compatible proxy-option is refresh, it
corresponds to the -pflush option and forces the proxy server to pass the request to a upstream server to
retrieve a fresh copy of the document. - The no-pragma option forces w3mir to use the HTTP/1.1
proxy control header, use this only with servers you
know to be new, otherwise it won't work at all. Use
of any option but refresh will also cause HTTP/1.1 to be used. - revalidate forces the proxy server to contact the
upstream server to validate that it has a fresh copy
of the document. This is nicer to the net than
refresh option which forces re-get of the document no matter if the server has a fresh copy already. - no-store forbids the proxy from storing the document
in other than in transient storage. This can be used
when transferring sensitive documents, but is by no
means any warranty that the document can't be found on
any storage device on the proxy-server after the
transfer. Cryptography, if legal in your contry, is
the solution if you want the contents to be secret. - refresh corresponds to the HTTP/1.0 header Pragma: nocache or the identical HTTP/1.1 Cache-control option. revalidate and no-store corresponds to max-age=0 and no-store respectively.
- Authorization
- w3mir supports only the basic authentication of
HTTP/1.0. This method can assign a password to a
given user/server/realm. The "user" is your user-name
on the server. The "server" is the server. The realm
is a HTTP concept. It is simply a grouping of files
and documents. One file or a whole directory hierar
chy can belong to a realm. One server may have many
realms. A user may have separate passwords for each
realm, or the same password for all the realms the
user has access to. A combination of a server and a
realm is called a domain. - Auth-Domain: server:port/realm
Give the server and port, and the belonging
realm (making a domain) that the following
authentication data holds for. You may spec
ify "*" wildcard for either of server:port and realm, this will work well if you only have
one usernme and password on all the servers
mirrored. - Auth-User: user
- Your user-name.
- Auth-Passwd: password
- Your password.
- These three directives may be repeated, in clusters,
as many times as needed to give the necessary authen
tication information - Disable-Headers: referer | user
- Stop w3mir from sending the given headers. This can
be used for anonymity, making your retrievals harder
to track. It will be even harder if you specify a
generic Agent, like Netscape. - Fixup: ...
- This directive controls some aspects of the separate
program w3mfix. w3mfix uses the same configuration
file as w3mir since it needs a lot of the information
in the w3mir configuration file to do it's work cor
rectly. w3mfix is used to make mirrors more
browseable on filesystems (disk or CDROM), and to fix
redirected URLs and some other URL editing. If you
want a mirror to be browseable of disk or CDROM you
almost certainly need to run w3mfix. In many cases it
is not necessary when you run a mirror to be used
through a WWW server. - To make w3mir write the data files w3mfix needs, and
do nothing else, simply put
Fixup: on - in the configuration file. To make w3mir run w3mfix
automatically after each time w3mir has completed a
mirror run specify
Fixup: run - w3mfix is documented in a separate man page in a
effort to not prolong this manpage unnecessarily. - Index-name: name-of-index-file
- When retriving URLs ending in '/' w3mir needs to
append a filename to store it localy. The default
value for this is 'index.html' (this is the most used,
its use originated in the NCSA HTTPD as far as I
know). Some WWW servers use the filename 'Wel
come.html' or 'welcome.html' instead (this was the
default in the old CERN HTTPD). And servers running
on limited OSes frequently use 'index.htm'. To keep
things consistent and sane w3mir and the server should
use the same name. Put
Index-name: welcome.html - when mirroring from a site that uses that convention.
- When doing a multiserver retrival where the servers
use two or more different names for this you should
use Apply rules to make the names consistent within
the mirror. - When making a mirror for use with a WWW server, the
mirror should use the same name as the new server for
this, to acomplish that Index-name should be combined with Apply. - Here is an example of use in the to latter cases when
Welcome.html is the prefered index name:
Index-name: Welcome.html
Apply: s~/index.html$~/Welcome.html~ - Similarly, if index.html is the prefered index name.
Apply: s~/Welcome.html~/index.html~ - Index-name is not needed since index.html is the default index name.
EXAMPLES
- * Just get the latest Dr-Fun if it has been changed since
the last time - w3mir http://sunsite.unc.edu/Dave/Dr-Fun/latest.jpg
- * Recursively fetch everything on the Star Wars site,
remove what is no longer at the server from the mirror: - w3mir -R -r http://www.starwars.com/
- * Fetch the contents of the Sega site through a proxy,
pausing for 30 seconds between each document - w3mir -r -p 30 -P www.foo.org:4321
- http://www.sega.com/
- * Do everything according to w3mir.cfg
- w3mir -cfgfile w3mir.cfg
- * A simple configuration file
- # Remember, options first, as many as you like, comma
- separated
Options: recurse, remove
#
# Start here:
URL: http://www.starwars.com/
#
# Speed things up
Pause: 0
#
# Don't get junk
Ignore: *.cgi
Ignore: *-cgi
Ignore: *.map
#
# Proxy:
HTTP-Proxy: www.foo.org:4321
#
# You _should_ cd away from the directory where the - config file is.
cd: starwars
#
# Authentication:
Auth-domain: server:port/realm
Auth-user: me
Auth-passwd: my_password
#
# You can use '*' in place of server:port and/or - realm:
Auth-domain: */*
Auth-user: otherme
Auth-user: otherpassword - Also:
# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:. It fetches- the first into the second
URL: http://www.math.uio.no/~janl/ math/janl
#
# These says that any documents refered to that lives - under these places
# should be gotten too. Into the named directories. - Two arguments are
# required for 'Also:'.
Also: http://www.math.uio.no/drift/personer/ - math/drift
Also: http://www.ifi.uio.no/~janl/ ifi/janl
Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai
#
# The options above will result in this directory hi - erarchy under
# where you started w3mir:
# w3mir/math/janl files from - http://www.math.uio.no/~janl
# w3mir/math/drift from - http://www.math.uio.no/drift/personer/
# w3mir/ifi/janl from - http://www.ifi.uio.no/~janl/
# w3mir/math-uib/nicolai from - http://www.mi.uib.no/~nicolai/
- Ignore-RE and Fetch-RE
- # Get only jpeg/jpg files, no gifs
Fetch-RE: m/.jp(e)?g$/
Ignore-RE: m/.gif$/ - Apply
- As I said earlier, Apply has not been used for Real
Work yet, that I know of. But Apply could, be used to map all web servers at the university of Oslo inside
the scope of retrieval very easily:
# Start at the main server
URL: http://www.uio.no/
# Change http://*.uio.no and http://129.240.* to bea subdirectory
# of http://www.uio.no/.
Apply:s~^http://(.*.uio.no(?:+)?)/~http://www.uio.no/$1/~i
Apply:s~^http://(129.240.[^:]*(?:+)?)/~http://www.uio.no/$1/~i - There are two rather extensive example files in the w3mir
distribution.
BUGS
The -lc switch does not work too well.
FEATURES
These are not bugs.
URLs with two /es ('//') in the path component does not
work as some might expect. According to my reading of the
URL spec. it is an illegal construct, which is a Good
Thing, because I don't know how to handle it if it's
legal.
If you start at http://foo/bar/ then index.html might be
gotten twice.
Some documents point to a point above the server root,
i.e., http://some.server/../stuff.html. Netscape, and
other browsers, in defiance of the URL standard documents
will change the URL to http://some.server/stuff.html.
W3mir will not.
Authentication is only tried if the server requests it.
This might lead to a lot of extra connections going up and
down, but that's the way it's gotta work for now.
SEE ALSO
w3mfix
AUTHORS
- w3mirs authors can be reached at w3mir-core@usit.uio.no.
w3mirs home page is at http://www.math.uio.no/~janl/w3mir/