crossbow(1)
NAME
crossbow - a front-end with hierarchical clustering and deterministic
annealing
SYNOPSIS
crossbow [OPTION...] [ARG...]
DESCRIPTION
Crossbow is document clustering front-end to libbow. This brief manpage
was written for the Debian GNU/Linux distribution since there is none
available in the main package.
Note that crossbow is not a supported program.
OPTIONS
- For building data structures from text files:
- --build-hier-from-dir
- When indexing a single directory, use the directory structure to build a class hierarchy
- -c, --cluster
- cluster the documents, and write the results to disk
- --classify
- Split the data into train/test, and classify the test data, outputing results in rainbow format
- --classify-files=DIRNAME
- Classify documents in DIRNAME, outputing `filename classname' pairs on each line.
- --cluster-output-dir=DIR
- After clustering is finished, write the cluster to directory DIR
- -i, --index
- tokenize training documents found under ARG..., build weight vectors, and save them to disk
- --index-multiclass-list=FILE
- Index the files listed in FILE. Each line of FILE should contain a filenames followed by a list of classnames to which that file belongs.
- --print-doc-names[=TAG]
- Print the filenames of documents contained in the model. If the optional TAG argument is given, print only the documents that have the specified tag.
- --print-matrix
- Print the word/document count matrix in an awk- or perl-accessible format. Format is sparse and includes the words and the counts.
- --print-word-probabilities=FILEPREFIX
- Print the word probability distribution in each leaf to files named FILEPREFIX-classname
- --query-server=PORTNUM Run crossbow in server mode, listening on socket
- number PORTNUM. You can try it by executing this command, then in a different shell window on the same machine typing `telnet localhost PORTNUM'.
- --use-vocab-in-file=FILENAME
- Limit vocabulary to just those words read as space-separated strings from FILE.
- Splitting options:
- --ignore-set=SOURCE
- How to select the ignored documents. Same format as --test-set. Default is `0'.
- --set-files-use-basename[=N]
- When using files to specify doc types, compare only the last N components the doc's pathname. That is use the filename and the last N-1 directory names. If N is not specified, it defaults to 1.
- --test-set=SOURCE
- How to select the testing documents. A number between 0 and 1 inclusive with a decimal point indicates a random fraction of all documents. The number of documents selected from each class is determined by attempting to match the proportions of the nonignore documents. A number with no decimal point indicates the number of documents to select randomly. Alternatively, a suffix of `pc' indicates the number of documents per-class to tag. The suffix 't' for a number or proportion indicates to tag documents from the pool of training documents, not the untagged documents. `remaining' selects all documents that remain untagged at the end. Anything else is interpreted as a filename listing documents to select. Default is `0.0'.
- --train-set=SOURCE
- How to select the training documents. Same format as --test-set. Default is `remaining'.
- --unlabeled-set=SOURCE How to select the unlabeled documents.
- Same format as --test-set. Default is `0'.
- --validation-set=SOURCE
- How to select the validation documents. Same format as --test-set. Default is `0'.
- Hierarchical EM Clustering options:
- --hem-branching-factor=NUM
- Number of clusters to create. Default is 2.
- --hem-deterministic-horizontal
- In the horizontal E-step for a document, set to zero the membership probabilities of all leaves, except the one matching the document's filename
- --hem-garbage-collection
- Add extra /Misc/ children to every internal node of the hierarchy, and keep their local word distributions flat
- --hem-incremental-labeling
- Instead of using all unlabeled documents in the M-step, use only the labeled documents, and incrementally label those unlabeled documents that are most confidently classified in the E-step
- --hem-lambdas-from-validation=NUM
- Instead of setting the lambdas from the labeled/unlabeled data (possibly with LOO), instead set the lambdas using held-out validation data. 0<NUM<1 is the fraction of unlabeled documents just before EM training of the classifier begins. Default is 0, which leaves this option off.
- --hem-max-num-iterations=NUM
- Do no more iterations of EM than this.
- --hem-maximum-depth=NUM
- The hierarchy depth beyond which it will not split. Default is 6.
- --hem-no-loo
- Do not use leave-one-out evaluation during the E-step.
- --hem-no-shrinkage
- Use only the clusters at the leaves; do not do anything with the hierarchy.
- --hem-no-vertical-word-movement
- Use EM just to set the vertical priors, not to set the vertical word distribution; i.e. do not to `full-EM'.
- --hem-pseudo-labeled
- After using the labels to set the starting point for EM, change all training documents to unlabeled, so that they can have their class labels re-assigned by EM. Useful for imperfectly labeled training data.
- --hem-restricted-horizontal
- In the horizontal E-step for a document, set to zero the membership probabilities of all leaves whose names are not found in the document's filename
- --hem-split-kl-threshold=NUM
- KL divergence value at which tree leaves will be split. Default is 0.2
- --hem-temperature-decay=NUM
- Temperature decay factor. Default is 0.9.
- --hem-temperature-end=NUM
- The final value of T. Default is 1.
- --hem-temperature-start=NUM
- The initial value of T.
- General options
- --annotations=FILE
- The sarray file containing annotations for the files in the index
- -b, --no-backspaces
- Don't use backspace when verbosifying progress (good for use in emacs)
- -d, --data-dir=DIR
- Set the directory in which to read/write word-vector data (default=~/.<program_name>).
- --random-seed=NUM
- The non-negative integer to use for seeding the random number generator
- --score-precision=NUM
- The number of decimal digits to print when displaying document scores
- -v, --verbosity=LEVEL
- Set amount of info printed while running; (0=silent, 1=quiet, 2=show-progess,...5=max)
- Lexing options
- --append-stoplist-file=FILE
- Add words in FILE to the stoplist.
- --exclude-filename=FILENAME
- When scanning directories for text files, skip files with name matching FILENAME.
- -g, --gram-size=N
- Create tokens for all 1-grams,... N-grams.
- -h, --skip-header
- Avoid lexing news/mail headers by scanning forward until two newlines.
- --istext-avoid-uuencode
- Check for uuencoded blocks before saying that the file is text, and say no if there are many lines of the same length.
- --lex-pipe-command=SHELLCMD
- Pipe files through this shell command before lexing them.
- --max-num-words-per-document=N
- Only tokenize the first N words in each document.
- --no-stemming
- Do not modify lexed words with a stemming function. (usually the default, depending on lexer)
- --replace-stoplist-file=FILE
- Empty the default stoplist, and add space-delimited words from FILE.
- -s, --no-stoplist
- Do not toss lexed words that appear in the stoplist.
- --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
- Default is usually 2.
- -S, --use-stemming
- Modify lexed words with the `Porter' stemming function.
- --use-stoplist
- Toss lexed words that appear in the stoplist. (usually the default SMART stoplist, depending on lexer)
- --use-unknown-word
- When used in conjunction with -O or -D, captures all words with occurrence counts below threshold as the `<unknown>' token
- --xxx-words-only
- Only tokenize words with `xxx' in them
- Mutually exclusive choice of lexers
- --flex-mail
- Use a mail-specific flex lexer
- --flex-tagged
- Use a tagged flex lexer
- -H, --skip-html
- Skip HTML tokens when lexing.
- --lex-alphanum
- Use a special lexer that includes digits in tokens, delimiting tokens only by non-alphanumeric characters.
- --lex-infix-string=ARG Use only the characters after ARG in each word for
- stoplisting and stemming. If a word does not contain ARG, the entire word is used.
- --lex-suffixing
- Use a special lexer that adds suffixes depending on Email-style headers.
- --lex-white
- Use a special lexer that delimits tokens by whitespace only, and does not change the contents of the token at all---no downcasing, no stemming, no stoplist, nothing. Ideal for use with an externally-written lexer interfaced to rainbow with --lex-pipe-cmd.
- Feature-selection options
- -D, --prune-vocab-by-doc-count=N
- Remove words that occur in N or fewer documents.
- -O, --prune-vocab-by-occur-count=N
- Remove words that occur less than N times.
- -T, --prune-vocab-by-infogain=N
- Remove all but the top N words by selecting words with highest information gain.
- Weight-vector setting/scoring method options
- --binary-word-counts
- Instead of using integer occurrence counts of words to set weights, use binary absence/presence.
- --event-document-then-word-document-length=NUM
- Set the normalized length of documents when --event-model=document-then-word
- --event-model=EVENTNAME
- Set what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word, document, document-then-word.
- Default is `word'.
- --infogain-event-model=EVENTNAME
- Set what objects will be considered the `events' when information gain is calculated. EVENTNAME can be one of: word, document, document-then-word.
- Default is `document'.
- -m, --method=METHOD
- Set the word weight-setting method; METHOD may be one of: fienberg-classify, hem-classify, hem-cluster, multiclass, default=naivebayes.
- --print-word-scores
- During scoring, print the contribution of each word to each class.
- --smoothing-dirichlet-filename=FILE
- The file containing the alphas for the dirichlet smoothing.
- --smoothing-dirichlet-weight=NUM
- The weighting factor by which to muliply the alphas for dirichlet smoothing.
- --smoothing-goodturing-k=NUM
- Smooth word probabilities for words that occur NUM or less times. The default is 7.
- --smoothing-method=METHOD
- Set the method for smoothing word probabilities to avoid zeros;
METHOD may be one of: goodturing, laplace, mestimate, wittenbell - --uniform-class-priors When setting weights, calculating infogain and
- scoring, use equal prior probabilities on classes.
- -?, --help
- Give this help list
- --usage
- Give a short usage message
- -V, --version
- Print program version
- Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.
REPORTING BUGS
Please report bugs related to this program to Andrew McCallum <mccallum@cs.cmu.edu>. If the bugs are related to the Debian package send
bugs to submit@bugs.debian.org
SEE ALSO
- arrow(1), archer(1), rainbow(1). The full documentation for crossbow
will be provided as a Texinfo manual. If the info and crossbow programs are properly installed at your site, the command
- info crossbow
- should give you access to the complete manual.
- You can also find documentation and updates for libbow at http://www.cs.cmu.edu/~mccallum/bow