arrow(1)
NAME
arrow - manual page for arrow
SYNOPSIS
arrow [OPTION...] [ARG...]
DESCRIPTION
Arrow is a document retrieval front-end to libbow, it uses TFIDF to
retrieve relevant documents.
EXAMPLES
If you have a database of documents in foo you would just need to type arrow --index foo to create the database. You could then make queries by typing arrow --query then typing your query, and pressing Control-D.
If you want to make many queries, it will be more efficient to run
arrow as a server, and query it multiple times without restarts by communicating through a socket. Type, for example, arrow --queryserver=9876 and access it through port number 9876. For example: telnet localhost 9876 In this mode there is no need to press Control-D to
end a query. Simply type your query on one line, and press return.
OPTIONS
- General options
- For building data structures from text files:
- -i, --index
- tokenize training documents found under ARG..., build weight vectors, and save them to disk
- For doing document retrieval using the data structures built with -i:
- -c, --compare=FILE
- Print the TFIDF cosine similarity metric of the query with this FILE.
- -n, --num-hits-to-show=N
- Show the N documents that are most similar to the query text (default N=1)
- -q, --query[=FILE]
- tokenize input from stdin [or FILE], then print document most like it
- --query-forking-server=PORTNUM
- Run arrow in socket server mode, forking a new process with every connection. Allows multiple simultaneous connections.
- --query-server=PORTNUM Run arrow in socket server mode.
Diagnostics- --print-coo
- Print word co-occurrence statistics.
- --print-idf
- Print, in unsorted order the IDF of all words in the model's vocabulary
- --annotations=FILE
- The sarray file containing annotations for the files in the index
- -b, --no-backspaces
- Don't use backspace when verbosifying progress (good for use in emacs)
- -d, --data-dir=DIR
- Set the directory in which to read/write word-vector data (default=~/.<program_name>).
- --random-seed=NUM
- The non-negative integer to use for seeding the random number generator
- --score-precision=NUM
- The number of decimal digits to print when displaying document scores
- -v, --verbosity=LEVEL
- Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max)
- Lexing options
- --append-stoplist-file=FILE
- Add words in FILE to the stoplist.
- --exclude-filename=FILENAME
- When scanning directories for text files, skip files with name matching FILENAME.
- -g, --gram-size=N
- Create tokens for all 1-grams,... N-grams.
- -h, --skip-header
- Avoid lexing news/mail headers by scanning forward until two newlines.
- --istext-avoid-uuencode
- Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length.
- --lex-pipe-command=SHELLCMD
- Pipe files through this shell command before lexing them.
- --max-num-words-per-document=N
- Only tokenize the first N words in each document.
- --no-stemming
- Do not modify lexed words with a stemming function. (usually the default, depending on lexer)
- --replace-stoplist-file=FILE
- Empty the default stoplist, and add space-delimited words from FILE.
- -s, --no-stoplist
- Do not toss lexed words that appear in the stoplist.
- --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
- Default is usually 2.
- -S, --use-stemming
- Modify lexed words with the `Porter' stemming function.
- --use-stoplist
- Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer)
- --use-unknown-word
- When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `<unknown>' token
- --xxx-words-only
- Only tokenize words with `xxx' in them
- Mutually exclusive choice of lexers
- --flex-mail
- Use a mail-specific flex lexer
- --flex-tagged
- Use a tagged flex lexer
- -H, --skip-html
- Skip HTML tokens when lexing.
- --lex-alphanum
- Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters.
- --lex-infix-string=ARG Use only the characters after ARG in each word for
- stoplisting and stemming. If a word does not contain ARG, the entire word is used.
- --lex-suffixing
- Use a special lexer that adds suffixes depending on Email-style headers.
- --lex-white
- Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd.
- Feature-selection options
- -D, --prune-vocab-by-doc-count=N
- Remove words that occur in N or fewer documents.
- -O, --prune-vocab-by-occur-count=N
- Remove words that occur less than N times.
- -T, --prune-vocab-by-infogain=N
- Remove all but the top N words by selecting words with highest information gain.
- Weight-vector setting/scoring method options
- --binary-word-counts
- Instead of using integer occurrence counts of words to set weights, use binary absence/presence.
- --event-document-then-word-document-length=NUM
- Set the normalized length of documents when --event-model=document-then-word
- --event-model=EVENTNAME
- Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word.
- Default is `word'.
- --infogain-event-model=EVENTNAME
- Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word.
- Default is `document'.
- -m, --method=METHOD
- Set the word weight-setting method; METHOD may be one of: tfidf_words, tfidf_log_words, tfidf_log_occur, tfidf, default=naivebayes.
- --print-word-scores
- During scoring, print the contribution of each word to each class.
- --smoothing-dirichlet-filename=FILE
- The file containing the alphas for the dirichlet smoothing.
- --smoothing-dirichlet-weight=NUM
- The weighting factor by which to muliply the alphas for dirichlet smoothing.
- --smoothing-goodturing-k=NUM
- Smooth word probabilities for words that occur NUM or less times. The default is 7.
- --smoothing-method=METHOD
- Set the method for smoothing word probabilities to avoid zeros;
METHOD may be one of: goodturing, laplace, mestimate, wittenbell - --uniform-class-priors When setting weights, calculating infogain and
- scoring, use equal prior probabilities on classes.
- -?, --help
- Give this help list
- --usage
- Give a short usage message
- -V, --version
- Print program version
- Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.
REPORTING BUGS
Please report bugs related to this program to Andrew McCallum <mccallum@cs.cmu.edu>. If the bugs are related to the Debian package send
bugs to submit@bugs.debian.org
SEE ALSO
archer(1), crossbow(1), rainbow(1).
- The full documentation for arrow will be provided as a Texinfo manual.
If the info and arrow programs are properly installed at your site, the
command
- info arrow
- should give you access to the complete manual.
- You can also find documentation and updates for libbow at http://www.cs.cmu.edu/~mccallum/bow