ocrodjvu(1)

NAME

ocrodjvu - OCR for DjVu files

SYNOPSIS

ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
ocrodjvu --save-script script-file [option...] djvu-file
ocrodjvu --in-place [option...] djvu-file
ocrodjvu --dry-run [option...] djvu-file
ocrodjvu {--version | --help | -h | --list-engines | --list-languages}

DESCRIPTION

ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.

The following OCR engines are supported:

o OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or: rec-tess) command, so that ultimately Tesseract acts as the OCR backend);
o Cuneiform for Linux[2].

OPTIONS

OCR engine options: --engine=engine-id
Use this OCR engine. The default is 'ocropus' (OCRopus).; --list-engines
Print list of available OCR engines.
Options controlling output: It is mandatory to use exactly one of the following options:; -o, --save-bundled=output-djvu-file
Save OCR results as a bundled multi-page document into
output-djvu-file.; -i, --save-indirect=index-djvu-file
Save OCR results as an indirect multi-page document. Use
index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable.; --save-script=script-file
Save a djvused script with OCR results into script-file.; --in-place
Save OCR results in place.

(Use this option to retain compatibility with ocrodjvu < 0.2.); --dry-run
Don't change any files, throw OCR results away.
Text segmentation options: -t lines, --details lines
Record location of every line. Don't record locations of particular words or characters.

This is the default for OCRopus 0.2.; -t words, --details=words
Record location of every line and every word. Don't record
locations of particular characters.

This is the default for OCRopus >= 0.3.1 and for Cuneiform.

This option is ineffective with OCRopus 0.2.; -t chars, --details=chars
Record location of every line, every word and every character.

This option is ineffective with OCRopus 0.2.; --word-segmentation=simple
Consider each non-empty sequence of non-whitespace characters a
single word.

This is the default, despite being linguistically incorrect.; --word-segmentation=uax29
Use the Unicode Text Segmentation[3] algorithm to break lines into words.

This option breaks assumptions of some DjVu tools that words are
separated by spaces, and therefore is it not recommended.
Other options: --clear-text
Remove existing hidden text if present in the pages not selected
for OCR.

(Use this option to retain compatibility with ocrodjvu < 0.2.); --ocr-only
Don't save pages that were not processed.; --language=language-id
Set recognition language. language-id is typically an ISO 639-2 three-letter code.

The default is 'eng' (English), unless the tesslanguage environment variable is set.; --list-languages
Print list of available languages.; -p, --pages=page-range
Specifies pages to process. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from
1.

The default is to process all pages.; -j, --jobs=n
Start up to n OCR processes.; -D, --debug
To ease debugging, don't delete intermediate files.; --version
Output version information and exit.; -h, --help
Display help and exit.

ENVIRONMENT

The following environment variables affects ocrodjvu:

tesslanguage: Recognition language for Tesseract.; (Use this variable is deprecated in favor of the --language option.)

AUTHOR

Jakub Wilk <ubanus@users.sf.net>: Author.

COPYRIGHT

NOTES

1. OCRopus: http://ocropus.googlecode.com/; 2. Cuneiform for Linux
http://launchpad.net/cuneiform-linux; 3. Unicode Text Segmentation
http://unicode.org/reports/tr29/

docs.sk

comprehensive documentation repository

See also