ocrodjvu(1)

NAME

ocrodjvu - OCR for DjVu files

SYNOPSIS

ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file
ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file
ocrodjvu --save-script script-file [option...] djvu-file
ocrodjvu --in-place [option...] djvu-file
ocrodjvu --dry-run [option...] djvu-file
ocrodjvu {--version | --help | -h | --list-engines | --list-languages}

DESCRIPTION

ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.

The following OCR engines are supported:

o OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or
rec-tess) command, so that ultimately Tesseract acts as the OCR backend);
o Cuneiform for Linux[2].

OPTIONS

OCR engine options
--engine=engine-id
Use this OCR engine. The default is 'ocropus' (OCRopus).
--list-engines
Print list of available OCR engines.
Options controlling output
It is mandatory to use exactly one of the following options:
-o, --save-bundled=output-djvu-file
Save OCR results as a bundled multi-page document into
output-djvu-file.
-i, --save-indirect=index-djvu-file
Save OCR results as an indirect multi-page document. Use
index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable.
--save-script=script-file
Save a djvused script with OCR results into script-file.
--in-place
Save OCR results in place.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--dry-run
Don't change any files, throw OCR results away.
Text segmentation options
-t lines, --details lines
Record location of every line. Don't record locations of particular words or characters.
This is the default for OCRopus 0.2.
-t words, --details=words
Record location of every line and every word. Don't record
locations of particular characters.
This is the default for OCRopus >= 0.3.1 and for Cuneiform.
This option is ineffective with OCRopus 0.2.
-t chars, --details=chars
Record location of every line, every word and every character.
This option is ineffective with OCRopus 0.2.
--word-segmentation=simple
Consider each non-empty sequence of non-whitespace characters a
single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
Use the Unicode Text Segmentation[3] algorithm to break lines into words.
This option breaks assumptions of some DjVu tools that words are
separated by spaces, and therefore is it not recommended.
Other options
--clear-text
Remove existing hidden text if present in the pages not selected
for OCR.
(Use this option to retain compatibility with ocrodjvu < 0.2.)
--ocr-only
Don't save pages that were not processed.
--language=language-id
Set recognition language. language-id is typically an ISO 639-2 three-letter code.
The default is 'eng' (English), unless the tesslanguage environment variable is set.
--list-languages
Print list of available languages.
-p, --pages=page-range
Specifies pages to process. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from
1.
The default is to process all pages.
-j, --jobs=n
Start up to n OCR processes.
-D, --debug
To ease debugging, don't delete intermediate files.
--version
Output version information and exit.
-h, --help
Display help and exit.

ENVIRONMENT

The following environment variables affects ocrodjvu:

tesslanguage
Recognition language for Tesseract.
(Use this variable is deprecated in favor of the --language option.)

SEE ALSO

djvu(1), ocroscript(1), tesseract(1)

AUTHOR

Jakub Wilk <ubanus@users.sf.net>
Author.

COPYRIGHT

Copyright (C) 2008, 2009, 2010 Jakub Wilk

NOTES

1. OCRopus
http://ocropus.googlecode.com/
2. Cuneiform for Linux
http://launchpad.net/cuneiform-linux
3. Unicode Text Segmentation
http://unicode.org/reports/tr29/
Copyright © 2010-2025 Platon Technologies, s.r.o.           Home | Man pages | tLDP | Documents | Utilities | About
Design by styleshout