ocrodjvu(1)
NAME
ocrodjvu - OCR for DjVu files
SYNOPSIS
ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
ocrodjvu --save-script script-file [option...] djvu-file
ocrodjvu --in-place [option...] djvu-file
ocrodjvu --dry-run [option...] djvu-file
ocrodjvu {--version | --help | -h | --list-engines | --list-languages}
DESCRIPTION
ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on
DjVu files.
The following OCR engines are supported:
- o OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or
- rec-tess) command, so that ultimately Tesseract acts as the OCR backend);
- o Cuneiform for Linux[2].
OPTIONS
- OCR engine options
- --engine=engine-id
Use this OCR engine. The default is 'ocropus' (OCRopus).
- --list-engines
Print list of available OCR engines.
- Options controlling output
- It is mandatory to use exactly one of the following options:
- -o, --save-bundled=output-djvu-file
Save OCR results as a bundled multi-page document into
output-djvu-file. - -i, --save-indirect=index-djvu-file
Save OCR results as an indirect multi-page document. Use
index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable. - --save-script=script-file
Save a djvused script with OCR results into script-file.
- --in-place
Save OCR results in place.(Use this option to retain compatibility with ocrodjvu < 0.2.)
- --dry-run
Don't change any files, throw OCR results away.
- Text segmentation options
- -t lines, --details lines
Record location of every line. Don't record locations of particular words or characters.This is the default for OCRopus 0.2.
- -t words, --details=words
Record location of every line and every word. Don't record
locations of particular characters.This is the default for OCRopus >= 0.3.1 and for Cuneiform.This option is ineffective with OCRopus 0.2. - -t chars, --details=chars
Record location of every line, every word and every character.This option is ineffective with OCRopus 0.2.
- --word-segmentation=simple
Consider each non-empty sequence of non-whitespace characters a
single word.This is the default, despite being linguistically incorrect. - --word-segmentation=uax29
Use the Unicode Text Segmentation[3] algorithm to break lines into words.This option breaks assumptions of some DjVu tools that words are
separated by spaces, and therefore is it not recommended. - Other options
- --clear-text
Remove existing hidden text if present in the pages not selected
for OCR.(Use this option to retain compatibility with ocrodjvu < 0.2.) - --ocr-only
Don't save pages that were not processed.
- --language=language-id
Set recognition language. language-id is typically an ISO 639-2 three-letter code.The default is 'eng' (English), unless the tesslanguage environment variable is set.
- --list-languages
Print list of available languages.
- -p, --pages=page-range
Specifies pages to process. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from
1.The default is to process all pages. - -j, --jobs=n
Start up to n OCR processes.
- -D, --debug
To ease debugging, don't delete intermediate files.
- --version
Output version information and exit.
- -h, --help
Display help and exit.
ENVIRONMENT
The following environment variables affects ocrodjvu:
- tesslanguage
- Recognition language for Tesseract.
- (Use this variable is deprecated in favor of the --language option.)
SEE ALSO
djvu(1), ocroscript(1), tesseract(1)
AUTHOR
- Jakub Wilk <ubanus@users.sf.net>
- Author.
COPYRIGHT
Copyright (C) 2008, 2009, 2010 Jakub Wilk
NOTES
- 1. OCRopus
- http://ocropus.googlecode.com/
- 2. Cuneiform for Linux
http://launchpad.net/cuneiform-linux - 3. Unicode Text Segmentation
http://unicode.org/reports/tr29/