clara(1)
NAME
clara - a cooperative OCR
SYNOPSIS
clara [options]
DESCRIPTION
Welcome. Clara OCR is a free OCR, written for systems supporting the C
library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at
http://www.claraocr.org/.
This documentation is extracted automatically from the comments of the
Clara OCR source code. It is known as "The Clara OCR Tutorial". There
is also an advanced manual known as "The Clara OCR Advanced User's Manual" (man page clara-adv(1), also available in HTML format). Developers
must read "The Clara OCR Developer's Guide" (man page clara-dev(1),
also available in HTML format).
CONTENTS
ArraySo let's try it. Of course we need a scanned page to do so. Clara OCR
requires graphic format PBM or PGM (TIFF and others must be converted,
the netpbm package contains various conversion tools). The Clara distribution package contains one small PBM file that you can use for a
first test. The name of this file is imre.pbm. If you cannot locate it,
download it or other files from http://www.claraocr.org/.
Alternatively, you can produce your own 600-dpi PBM or PGM files scanning any printed document (hints for scanning pages and converting them
to PBM are given on the section "Scanning books" of the Clara OCR
Advanced User's Manual).
ArrayYes, Clara OCR must be trained. Training is a tedious procedure, but
it's a must for those who need a customizable OCR, apt to adapt to a
perhaps uncommon printing style.
Before training, a process called segmentation must be performed. Press
the right button of the mouse over the OCR button, select "Segmentation" on the menu that will pop out and wait the operation complete.
Now, on the "page" tab, observe the image of the document presented on
the top window. You'll see the symbols greyed, because the OCR currently does not know their transliterations. Try to select one symbol
using the mouse (click the mouse button 1 over it). A black elliptic
cursor will appear around that symbol. This cursor is called the
"graphic cursor". You can move the graphic cursor around the document
using the arrow keys.
Now observe the bottom window on the "page" tab. That window presents
some detailed information on the current symbol (that one identified by
the graphic cursor). When the "show web clip" option on the "View" menu
is selected, a clip of the document around the current symbol, is displayed too. In some cases, this clip is useful for better visualization. The name "web clip" is because this same image is exported to the
Clara OCR web interface when cooperative training and revision through
the Internet is being performed.
To inform the OCR about the transliteration of one symbol, just type
the corresponding key. For instance, if the current symbol is a letter
"a", just type the "a" key. Observe that the trained symbol becomes
black. Each symbol trained will be learned by the OCR, its bitmap will
be called a "pattern", and it will be used as such when trying to
deduce the transliteration of unknown symbols.
Remark: in our test, the user chose the symbol to be trained. However,
Clara OCR can choose by itself the symbols to be trained. This feature
is called "build the bookfont automatically" (found on the "tune" tab).
To use it, select the corresponding checkbos and classify the symbols
as explained later.
ArrayBefore going further, it's important to know how to save your work. The
file menu contains one item labelled "save session". When selected, it
will create or overwrite three files on the working directory: "patterns", "acts" and "page.session", where "page" is the name of the file
currently loaded, without the "pbm" or "pgm" tag (in out example,
"imre"). So, to remove all data produced by OCR sessions, remove manually the files "*.session", "patterns" and "acts".
Note that the files "patterns" and "acts" are shared by all PBM or PGM
pages, so a symbol trained from one page is reused on the other pages.
The ".session" files however are per-page. Pages with the same graphic
characteristics, and only them, must be put on one same directory, in
order to share the same patterns.
ArrayThe OCR process is divided into various steps, for instance "classification", "build", etc. These steps are acessible clicking the mouse
button 3 over the OCR button. Each one can be started independently
and/or repeated at any moment. In fact, the more you know about these
steps, the better you'll use them.
Clicking the "OCR" button with the mouse button 1, all steps will be
started in sequence. The "OCR" button remains on the "selected" state
while some step is running.
ArrayAfter training some symbols, we're ready to apply the just acquired
knowledge to deduce the transliteration of non-trained symbols. For
that, Clara OCR will compare the non-trained symbols with those trained
("patterns"). Clara OCR offers nice visual modes to present the comparison of each symbol with each pattern. To activate the visual modes,
enter the View menu and select (for instance) the "show comparisons"
option.
Now start the "classification" step (click the mouse button 3 over the
OCR button and select the "classification" item) and observe what happens. Depending on your hardware and on the size of the document, this
operation may take long to complete (e.g. 5 minutes). Hopefully it'll
be much faster (say, 30 seconds).
When the classification finishes, observe that some nontrained symbols
became black. Each such symbol was found similar to some pattern.
Select one black symbol, and Clara will draw a gray ellipse around each
class member (except the selected symbol, identified by the black
graphic cursor). You can switch off this feature unselecting the "Show
current class" item on the "View" menu.
ArrayThe usual meaning of "classification" for OCRs is to deduce for each
symbol if it is a letter "a" or the letter "b", or a digit "1", etc. As
the total number of different symbols is small (some tenths), there
will be a small quantity of classes.
However, instead of classifying each symbol as being the letter "a", or
the digit "1", or whatever, Clara OCR builds classes of symbols with
similar shapes, not necessarily assigning a transliteration for each
symbol. So as sometimes the bitmap comparison heuristics consider two
true letters "a" dissimilar (due to printing differences or defects),
the Clara OCR classifier will brake the set of all letters "a" in various untransliterated subclasses.
Therefore, the classification result may be a much larger number of
classes (thousands or more), not only because of those small differences or defects, but also because the classification heuristics are
currently unable to scale symbols or to "boldfy" or "italicize" a symbol.
ArrayNow we're ready to build the OCR output. Just start the "build" step.
The action performed will be basically to detect text words and lines,
and output the transliterations, trained or deduced, of all symbols.
The output will be presented on the "PAGE (output)" window.
Each character on the "PAGE (output)" window behaves like a HTML hyperlink. Click it to select the current symbol both on the "PAGE" window
and on the "PAGE (symbol)" window. Note that the transliteration of
unknow symbols is substituted by their internal IDs (for instance
"[133]").
ArrayRemark: As to version 20031214 the merging heristics are only partially
implemented, and in most cases they won't produce any effect.
ArrayNow let's talk about accents.
ArrayAs explained earlier, trained symbols become patterns (unless you mark
it "bad"). The collection of all patterns is called "book font" (the
term "book" is to distinguish it from the GUI font). Clara OCR stores
all pattern in the "patterns" file on the work directory, when the
"save session" entry on the "File" menu is selected.
Clara OCR itself can choose the patterns and populate the book font. To
do so, just select the "Build the font automatically" item on the
"tune" tab, and classify the symbols.
To browse the patterns, click the "pattern" tab one or more times to
enter the "Pattern (list)" window. The "PATTERN (list)" mode displays
the bitmap and the properties of each pattern in a (perhaps very long)
form. Click the "zoom" button to adjust the size of the pattern
bitmaps. Use the scroolbar or the Next (Page Down) or Previous (Page
Up) keys to navigate. Use the sort options on the "Edit" menu to change
the presentation order.
ArrayIf the GUI becomes trashed or blank, press C-l to redraw it.
By now, the GUI do not support cut-and-paste. To save to a file the
contents of the "PAGE (list)" window, use the "Write report" item on
the "File" menu.
The "OCR" button will enter "pressed" stated in some unexpected situations, like during dialogs. This behaviour will be fixed soon.
The "STOP" button do not stop immediately the OCR operation in course
(e.g. classification). Clara OCR only stops the operation in course in
"secure" points, where all data structures are consistent.
The OCR output is automatically saved to the file page.html (or
page.txt if the option -o was used), where "page" is the name of the
currently loaded page, without the "pbm" or "pgm" tag. This file is
created by the "generate output" step on the menu that appears when the
mouse button 3 is pressed over the OCR button.
ArrayClara OCR "fun codes" are similar to videogame "codes" (for those who
have never heard about that, videogame "codes" are special sequences of
mouse or key clicks that make your player invulnerable, or obtain maximum energy, or perform an unexpected action, etc).
ArrayClara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote
the internal preprocessor. Clara OCR includes bugfixes produced by
other developers. The Changelog (http://www.claraocr.org/CHANGELOG)
acknowledges all them (see below). Imre Simon contributed high-volume
tests, discussions with experts, selection of bibliographic resources,
propaganda and many ideas on how to make the software more useful.
Ricardo authored various free materials, some included (at least) in
Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue",
the ispell dictionary br.ispell and the proxy axw3). He recently ported
the EiC interpreter to the Psion 5 handheld and patched the Xt-based
vncviewer to scale framebuffers and compute image diffs. Ricardo works
as an independent developer and instructor. He received no financial
aid to develop Clara OCR. He's not an employee of any company or organization.
Imre Simon promotes the usage and development of free technologies and
information from his research, teaching and administrative labour at
the University.
Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on
character isolation and recognition. Richard Stallman suggested
improvements on how to generate HTML output. Marius Vollmer is helping
to add Guile support. Jacques Le Marois helped on the announce process.
We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley
Digital Library, and Wanderley Antonio Cavassin, Janos Simon and
Roberto Marcondes Cesar Junior for some web and bibliographic pointers.
Bruno Barbieri Gnecco provided hints and explanations about GOCR (main
author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is
gently supporting our tentatives of using portions of his code. Adriano
Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the
tutorial before the first announce. Eduardo Marcel Macan packaged Clara
OCR for Debian and suggested some improvements. Mandrakesoft is hosting
claraocr.org. We acknowledge Conectiva and SuSE for providing copies of
their outstanding distributions. Finally, we acknowledge the late Jose
Hugo de Oliveira Bussab for his interest in our work.
Adriano Nagelschmidt Rodrigues donated a 15" monitor.
The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.
The names cited by the CHANGELOG (and not cited before) follow (small
patches, bug reports, specfiles, suggestions, explanations, etc).
- Brian G. (win32), Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, Franz Bakan (OS/2), groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (NetBSD), Tim McNerney, Tyler Akins.