archer(1)
NAME
archer - AltaVista-style document retrieval front-end using libbow
SYNOPSIS
archer [OPTION...] [ARG...]
DESCRIPTION
Archer is a standalone program that does document retrieval with
AltaVista-type queries, using +, -, "", etc. The commands in the arrow
examples in the manpage also work for archer.
OPTIONS
- For building data structures from text files:
- -i, --index=DIRNAME
- Tokenize training documents found under DIRNAME, and save them to disk
- --index-lines=FILENAME Like --index, except index each line of FILENAME
- as if it were a separate document. Documents are named after sequential line numbers.
- For doing document retreival using the data structures built with -i:
- -n, --num-hits-to-show=N
- Show the N documents that are most similar to the query text (default N=1)
- -q, --query=WORDS
- tokenize input from stdin [or FILE], then print document most like it
- --query-forking-server=PORTNUM
- Run archer in socket server mode, forking a new process with every connection. Allows multiple simultaneous connections.
- --query-server=PORTNUM Run archer in socket server mode.
- --score-is-raw-count
- Instead of using a weighted sum of logs, the score of a document will be simply the number of terms in both the query and the document.
- Diagnostics
- -p, --print-all
- Print, in unsorted order, all the document indices, positions and words
- -s, --print-word-stats
- Print the number of times each word occurs.
- General options
- --annotations=FILE
- The sarray file containing annotations for the files in the index
- -b, --no-backspaces
- Don't use backspace when verbosifying progress (good for use in emacs)
- -d, --data-dir=DIR
- Set the directory in which to read/write word-vector data (default=~/.<program_name>).
- --random-seed=NUM
- The non-negative integer to use for seeding the random number generator
- --score-precision=NUM
- The number of decimal digits to print when displaying document scores
- -v, --verbosity=LEVEL
- Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max)
- Lexing options
- --append-stoplist-file=FILE
- Add words in FILE to the stoplist.
- --exclude-filename=FILENAME
- When scanning directories for text files, skip files with name matching FILENAME.
- -g, --gram-size=N
- Create tokens for all 1-grams,... N-grams.
- -h, --skip-header
- Avoid lexing news/mail headers by scanning forward until two newlines.
- --istext-avoid-uuencode
- Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length.
- --lex-pipe-command=SHELLCMD
- Pipe files through this shell command before lexing them.
- --max-num-words-per-document=N
- Only tokenize the first N words in each document.
- --no-stemming
- Do not modify lexed words with a stemming function. (usually the default, depending on lexer)
- --no-stoplist
- Do not toss lexed words that appear in the stoplist.
- --replace-stoplist-file=FILE
- Empty the default stoplist, and add space-delimited words from FILE.
- --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
- Default is usually 2.
- -S, --use-stemming
- Modify lexed words with the `Porter' stemming function.
- --use-stoplist
- Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer)
- --use-unknown-word
- When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `<unknown>' token
- --xxx-words-only
- Only tokenize words with `xxx' in them
- Mutually exclusive choice of lexers
- --flex-mail
- Use a mail-specific flex lexer
- --flex-tagged
- Use a tagged flex lexer
- -H, --skip-html
- Skip HTML tokens when lexing.
- --lex-alphanum
- Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters.
- --lex-infix-string=ARG Use only the characters after ARG in each word for
- stoplisting and stemming. If a word does not contain ARG, the entire word is used.
- --lex-suffixing
- Use a special lexer that adds suffixes depending on Email-style headers.
- --lex-white
- Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd.
- Feature-selection options
- -D, --prune-vocab-by-doc-count=N
- Remove words that occur in N or fewer documents.
- -O, --prune-vocab-by-occur-count=N
- Remove words that occur less than N times.
- -T, --prune-vocab-by-infogain=N
- Remove all but the top N words by selecting words with highest information gain.
- Weight-vector setting/scoring method options
- --binary-word-counts
- Instead of using integer occurrence counts of words to set weights, use binary absence/presence.
- --event-document-then-word-document-length=NUM
- Set the normalized length of documents when --event-model=document-then-word
- --event-model=EVENTNAME
- Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word.
- Default is `word'.
- --infogain-event-model=EVENTNAME
- Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word.
- Default is `document'.
- -m, --method=METHOD
- Set the word weight-setting method; METHOD may be one of:
- --print-word-scores
- During scoring, print the contribution of each word to each class.
- --smoothing-dirichlet-filename=FILE
- The file containing the alphas for the dirichlet smoothing.
- --smoothing-dirichlet-weight=NUM
- The weighting factor by which to muliply the alphas for dirichlet smoothing.
- --smoothing-goodturing-k=NUM
- Smooth word probabilities for words that occur NUM or less times. The default is 7.
- --smoothing-method=METHOD
- Set the method for smoothing word probabilities to avoid zeros;
METHOD may be one of: goodturing, laplace, mestimate, wittenbell - --uniform-class-priors When setting weights, calculating infogain and
- scoring, use equal prior probabilities on classes.
- -?, --help
- Give this help list
- --usage
- Give a short usage message
- -V, --version
- Print program version
- Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.
REPORTING BUGS
Please report bugs related to this program to Andrew McCallum <mccallum@cs.cmu.edu>. If the bugs are related to the Debian package send
bugs to submit@bugs.debian.org
SEE ALSO
rainbow(1), arrow(1), crossbow(1).
- The full documentation for arrow will be provided as a Texinfo manual.
If the info and arrow programs are properly installed at your site, the
command
- info arrow
- should give you access to the complete manual.
- You can also find documentation and updates for libbow at http://www.cs.cmu.edu/~mccallum/bow