clara(1)

NAME

clara - a cooperative OCR

SYNOPSIS

clara [options]

DESCRIPTION

Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.

This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Tutorial". There is also an advanced manual known as "The Clara OCR Advanced User's Manual" (man page clara-adv(1), also available in HTML format). Developers must read "The Clara OCR Developer's Guide" (man page clara-dev(1), also available in HTML format).

ArraySo let's try it. Of course we need a scanned page to do so. Clara OCR requires graphic format PBM or PGM (TIFF and others must be converted, the netpbm package contains various conversion tools). The Clara distribution package contains one small PBM file that you can use for a first test. The name of this file is imre.pbm. If you cannot locate it, download it or other files from http://www.claraocr.org/. Alternatively, you can produce your own 600-dpi PBM or PGM files scanning any printed document (hints for scanning pages and converting them to PBM are given on the section "Scanning books" of the Clara OCR Advanced User's Manual).

ArrayYes, Clara OCR must be trained. Training is a tedious procedure, but it's a must for those who need a customizable OCR, apt to adapt to a perhaps uncommon printing style.

Before training, a process called segmentation must be performed. Press the right button of the mouse over the OCR button, select "Segmentation" on the menu that will pop out and wait the operation complete.

Now, on the "page" tab, observe the image of the document presented on the top window. You'll see the symbols greyed, because the OCR currently does not know their transliterations. Try to select one symbol using the mouse (click the mouse button 1 over it). A black elliptic cursor will appear around that symbol. This cursor is called the "graphic cursor". You can move the graphic cursor around the document using the arrow keys.

Now observe the bottom window on the "page" tab. That window presents some detailed information on the current symbol (that one identified by the graphic cursor). When the "show web clip" option on the "View" menu is selected, a clip of the document around the current symbol, is displayed too. In some cases, this clip is useful for better visualization. The name "web clip" is because this same image is exported to the Clara OCR web interface when cooperative training and revision through the Internet is being performed.

To inform the OCR about the transliteration of one symbol, just type the corresponding key. For instance, if the current symbol is a letter "a", just type the "a" key. Observe that the trained symbol becomes black. Each symbol trained will be learned by the OCR, its bitmap will be called a "pattern", and it will be used as such when trying to deduce the transliteration of unknown symbols.

Remark: in our test, the user chose the symbol to be trained. However, Clara OCR can choose by itself the symbols to be trained. This feature is called "build the bookfont automatically" (found on the "tune" tab). To use it, select the corresponding checkbos and classify the symbols as explained later.

ArrayBefore going further, it's important to know how to save your work. The file menu contains one item labelled "save session". When selected, it will create or overwrite three files on the working directory: "patterns", "acts" and "page.session", where "page" is the name of the file currently loaded, without the "pbm" or "pgm" tag (in out example, "imre"). So, to remove all data produced by OCR sessions, remove manually the files "*.session", "patterns" and "acts".

Note that the files "patterns" and "acts" are shared by all PBM or PGM pages, so a symbol trained from one page is reused on the other pages. The ".session" files however are per-page. Pages with the same graphic characteristics, and only them, must be put on one same directory, in order to share the same patterns.

ArrayThe OCR process is divided into various steps, for instance "classification", "build", etc. These steps are acessible clicking the mouse button 3 over the OCR button. Each one can be started independently and/or repeated at any moment. In fact, the more you know about these steps, the better you'll use them.

Clicking the "OCR" button with the mouse button 1, all steps will be started in sequence. The "OCR" button remains on the "selected" state while some step is running.

ArrayAfter training some symbols, we're ready to apply the just acquired knowledge to deduce the transliteration of non-trained symbols. For that, Clara OCR will compare the non-trained symbols with those trained ("patterns"). Clara OCR offers nice visual modes to present the comparison of each symbol with each pattern. To activate the visual modes, enter the View menu and select (for instance) the "show comparisons" option.

Now start the "classification" step (click the mouse button 3 over the OCR button and select the "classification" item) and observe what happens. Depending on your hardware and on the size of the document, this operation may take long to complete (e.g. 5 minutes). Hopefully it'll be much faster (say, 30 seconds).

When the classification finishes, observe that some nontrained symbols became black. Each such symbol was found similar to some pattern. Select one black symbol, and Clara will draw a gray ellipse around each class member (except the selected symbol, identified by the black graphic cursor). You can switch off this feature unselecting the "Show current class" item on the "View" menu.

ArrayThe usual meaning of "classification" for OCRs is to deduce for each symbol if it is a letter "a" or the letter "b", or a digit "1", etc. As the total number of different symbols is small (some tenths), there will be a small quantity of classes.

However, instead of classifying each symbol as being the letter "a", or the digit "1", or whatever, Clara OCR builds classes of symbols with similar shapes, not necessarily assigning a transliteration for each symbol. So as sometimes the bitmap comparison heuristics consider two true letters "a" dissimilar (due to printing differences or defects), the Clara OCR classifier will brake the set of all letters "a" in various untransliterated subclasses.

Therefore, the classification result may be a much larger number of classes (thousands or more), not only because of those small differences or defects, but also because the classification heuristics are currently unable to scale symbols or to "boldfy" or "italicize" a symbol.

ArrayNow we're ready to build the OCR output. Just start the "build" step. The action performed will be basically to detect text words and lines, and output the transliterations, trained or deduced, of all symbols. The output will be presented on the "PAGE (output)" window.

Each character on the "PAGE (output)" window behaves like a HTML hyperlink. Click it to select the current symbol both on the "PAGE" window and on the "PAGE (symbol)" window. Note that the transliteration of unknow symbols is substituted by their internal IDs (for instance "[133]").

ArrayRemark: As to version 20031214 the merging heristics are only partially implemented, and in most cases they won't produce any effect.

ArrayNow let's talk about accents.

ArrayAs explained earlier, trained symbols become patterns (unless you mark it "bad"). The collection of all patterns is called "book font" (the term "book" is to distinguish it from the GUI font). Clara OCR stores all pattern in the "patterns" file on the work directory, when the "save session" entry on the "File" menu is selected.

Clara OCR itself can choose the patterns and populate the book font. To do so, just select the "Build the font automatically" item on the "tune" tab, and classify the symbols.

To browse the patterns, click the "pattern" tab one or more times to enter the "Pattern (list)" window. The "PATTERN (list)" mode displays the bitmap and the properties of each pattern in a (perhaps very long) form. Click the "zoom" button to adjust the size of the pattern bitmaps. Use the scroolbar or the Next (Page Down) or Previous (Page Up) keys to navigate. Use the sort options on the "Edit" menu to change the presentation order.

ArrayIf the GUI becomes trashed or blank, press C-l to redraw it.

By now, the GUI do not support cut-and-paste. To save to a file the contents of the "PAGE (list)" window, use the "Write report" item on the "File" menu.

The "OCR" button will enter "pressed" stated in some unexpected situations, like during dialogs. This behaviour will be fixed soon.

The "STOP" button do not stop immediately the OCR operation in course (e.g. classification). Clara OCR only stops the operation in course in "secure" points, where all data structures are consistent.

The OCR output is automatically saved to the file page.html (or page.txt if the option -o was used), where "page" is the name of the currently loaded page, without the "pbm" or "pgm" tag. This file is created by the "generate output" step on the menu that appears when the mouse button 3 is pressed over the OCR button.

ArrayClara OCR "fun codes" are similar to videogame "codes" (for those who have never heard about that, videogame "codes" are special sequences of mouse or key clicks that make your player invulnerable, or obtain maximum energy, or perform an unexpected action, etc).

ArrayClara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.

Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.

Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.

Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.

Adriano Nagelschmidt Rodrigues donated a 15" monitor.

The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.

The names cited by the CHANGELOG (and not cited before) follow (small patches, bug reports, specfiles, suggestions, explanations, etc).

Brian G. (win32), Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, Franz Bakan (OS/2), groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (NetBSD), Tim McNerney, Tyler Akins.

docs.sk

comprehensive documentation repository

See also

clara(1)

NAME

SYNOPSIS

DESCRIPTION

CONTENTS