WORDLIST2DAWG(1)

NAME

tesseract - command line OCR tool

SYNOPSIS

Part  of  the  process to train tesseract for a new language. Tesseract
uses 3 dictionary files for each language. Two of the files  are  coded
as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8
text file. To make the DAWG dictionary files, you first need a wordlist
for  your language. The wordlist is formatted as a UTF-8 text file with
one word per line. Split the  wordlist  into  two  sets:  the  frequent
words,  and  the  rest of the words, and then use wordlist2dawg to make
the DAWG files:

wordlist2dawg frequent_words_list freq-dawg

wordlist2dawg words_list word-dawg

DESCRIPTION

This manual page documents briefly the wordlist2dawg command.

tesseract is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.

AUTHOR

tesseract was written by Ray Smith.

This manual page was written by Jeffrey Ratcliffe <Jeffrey.Ratcliffe@gmail.com>, for the Debian project (but may be used by others).

docs.sk

comprehensive documentation repository

See also

WORDLIST2DAWG(1)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR