clara(1)

NAME

clara - a cooperative OCR

SYNOPSIS

clara [options]

DESCRIPTION

Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.

This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Advanced User's Manual". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". Developers must read "The Clara OCR Developer's Guide".

ArrayClara is an optical character recognition (OCR) software, a program that tries to identify the graphic images of the characters from a scanned document, converting their digital images to ASC, ISO or other codes.

The name Clara stands for "Cooperative Lightweight chAracter Recognizer".

ArrayFor some years now we have tested and used OCR softwares, mainly for old books. Popular OCR softwares (those bundled with scanners) are useful tools. However, OCR is not a simple task. The results obtained using those programs vary largely depending on the the printed document, and, for most texts we're interested on, the results are really poor or even unusable. In fact, it's not a surprise that many digitalization projects prefer not to use OCR, but typists only.

For a programmer, it is somewhat intuitive that OCR could achieve good results even from low quality texts, when an add-hoc approach is used, focusing one specific book (for instance). Within this approach, OCR becomes a matter of finding one software adequate for the texts you're trying to OCR, or perhaps develop a new one. So a free and easy to customize OCR (on the source code level) would be a valuable resource for text digitalization projects.

ArrayIt's not a bad idea to enumerate some principles that have driven Clara OCR development. They'll make easier to understand the features and limitations of the software (these principles may change along time).

1. Clara is an OCR for printed texts, not for handwritten texts.

2. Clara was not designed to be used to OCR one or two single pages, but to OCR a large number of documents with the same graphic characteristics (font, size, etc). So it can take advantage of a fine (and perhaps expensive) training. This will be tipically the case when OCRing an entire book.

3. We chose not support directly multiple graphic formats, but only Jeff Poskanzer's raw PBM and PGM. Non-PBM/PGM files will be read through filters.

4. Clara OCR wants to be a tool that makes viable the sum and reuse of human revision effort. Because of this, on the OCR model implemented by Clara, training and revision are one same thing. The revision is a sum of punctual and independent acts and alternates with reprocessing steps along a refinement process.

5. The Clara GUI was implemented and behaves like a minimalistic HTML viewer. This is just an easy and standard way to implement a forms interface.

6. We have tried to make the source code portable across platforms that support the C library and the Xlib. Clara has no special provision to be ported to environments that do not support the Xlib. We avoided to use a higher level graphic environment like Motif, GTK or Qt, but we do not discourage initiatives to add code to Clara OCR adapt or adapt better to these or other graphic environments.

ArrayClara OCR focuses the Latin Alphabet ("a", "b", "c", ...), used by most European languages, and the decimal digits ("0", "1", "2", ...), but we're trying to support as many alphabets as possible.

To say that Clara OCR supports a given alphabet means that Clara OCR

(a) is able to be trained from the keyboard for the symbols of that alphabet, eventually applying some transliteration from that alphabet to latin. For instance, when OCRing a greek text, if the user presses the latin "a" key (assuming that the keyboard has latin labels), Clara is expected to train the current symbol as "alpha".

(b) knows the vertical alignment of each letter of that alphabet, for instance, knows that the bottom of an "e" is aligned at the baseline;

(d) contains code to help avoiding common mistakes, like recognizing "e" as "c", "l" as "1", etc.

To say that Clara OCR supports a given alphabet does not necessarily mean that Clara OCR

(a) knows some particular encoding (ISO-8859-X, Unicode, etc) for that alphabet;

(b) contains or is able to use fonts for that alphabet to display the OCR output on the PAGE (OUTPUT) window.

ArrayClara differs from other OCR softwares in various aspects:

1. Most known OCRs are non-free and Clara is free. Clara focus the X Windows System. Clara offers batch processing, a web interface and supports cooperative revision effort.

2. Most OCR softwares focus omnifont technology disregarding training. Clara does not implement omnifont techniques and concentrate on building specialized fonts (some day in the future, however, maybe we'll try classification techniques that do not require training).

3. Most OCR softwares make the revision of the recognized text a process totally separated from the recognition. Clara pragmatically joins the two processes, and makes training and revision one same thing. In fact, the OCR model implemented by Clara is an interactive effort where the usage of the heuristics alternates with revision and visual finetuning of the OCR, guided by the user experience and feeling.

4. Clara allows to enter the transliteration of each pattern using an interface that displays a graphic cursor directly over the image of the scanned page, and builds and maintains a mapping between graphic symbols and their transliterations on the OCR output. This is a potentially useful mechanism for documentation systems, and a valuable tool for typists and reviewers. In fact, Clara OCR may be seen as a productivity tool for typists, instead of a typical OCR.

5. Most OCR softwares are integrated to scanning tools offerring to the user an unified interface to execute all steps from scanning to recognition. Clara does not offer one such integrated interface, so you need a separate software (e.g. SANE) to perform scanning.

ArrayClara OCR will run on a PC (386, 486 or Pentium) with GNU/Linux and Xwindows. Clara OCR will hopefully compile and run on a PC with any unix-like operating system and Xwindows. Currently Clara OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems lacking X windows support (e.g. MS-Windows). Higher-level libraries like Motif, GTK or Qt are not required.

A relatively fast CPU is recommended (300MHz or more). Memory usage depends on the documents, and may range from some few megabytes to various tenths os megabytes The normal operation will create session files on your hard disk, so some megabytes of free disk space are required (a large project may require plents of gigabytes). Clara OCR can read and write gzipped files (see the -z command-line switch).

ArrayFor those who need to download and compile the source code (hopefully this will be unnecessary for most users as soon as Clara binary distributions become available), it may be downloaded from http://www.claraocr.org/. It's a compressed tar archive with a name like clara-x.y.tar.gz (x.y is the version number).

ArrayThis subsection is intended to help people that are experiencing fatal errors when building the executable or when starting it. After each error message we'll point out some hints.

Bear in mind that most hints given below are very elementary concerning Unix-like systems. If you have problems, try to read all hints because details explained once are not repeated. If you cannot understand them, please try to ask your local experts, or try to read an introductory book on Unix things. Please don't email questions like these to the Clara developers, except when the hint suggests it.

ArrayClara OCR is intended to OCR a relatively large collection of pages at once, typically a book. So we will refer the material that we are OCRing as "the book".

Let's describe a small but real project as an example on how to use Clara to OCR one "book". This section is in fact an in-depth tutorial on using Clara OCR. In order to try all techniques explained along this section, please download and uncompress the file referred as "page 143" of Manuel Bernardes Branco Dictionary (Lisbon, 1879), available at http://www.claraocr.org. It's a tarball containing the two text columns (one per file) of that page.

ArrayClara OCR cannot scan paper documents by itself. Scanning must be performed by another program. The Clara OCR development effort is using SANE (http://www.mostang.com/sane) to produce 600 or 300 dpi images. The Clara OCR heuristics are tuned to 600 dpi.

Scanners offer three scanning modes: black-and-white (also known as "bitmap" or "lineart", however the meaning of these words may vary depending on the context), "grayscale" and "color". Clara OCR requires black-and-white or grayscale input. Both black-and-white and grayscale images may be saved in a variety of formats by scanning programs. However, only PBM (for black-and-white) and PGM (for grayscale) formats are recognized. Generally grayscale 600 or 300 dpi will be the best choice, but black-and-white 600 dpi may be good for new, high quality printed materials. If your scanning program do not support the PBM or PGM formats, try to save the images in TIFF format and convert to PBM or PGM using the command tifftopnm. If for some reason the TIFF format cannot be used, choose any other format that preserves all data (don't use "compressing" formats like JPEG), and for which a conversion tool is available, to convert it to PBM or PGM.

Remark: Programs that scan or handle (e.g. rotate) images may sometimes perform unexpected tasks, as applying dithering or reducing algorithms by themselves. An image transformed to become nice or small may be useless for OCR purposes.

Remark: The PBM and PGM formats do not carry the original resolution (dots-per-inch) at which the image was scanned. As some heuristics require that information, Clara OCR expects to be informed about it through the command-line switch -y (so take note of the resolution used).

Grayscale means that each pixel assumes one gray "level", typically from 0 (black) to 255 (white). This is a good choice for scanning old or low-quality printed materials, because it's possible to use specialized programs to analyse the image and choose a "threshold", in such a way that all pixels above that threshold will be considered "white", and all others will be considered black (when scanning in black-andwhite mode, the threshold is chosen by the scanning program or by the user). The threshold may be global (fixed for the entire page) or local (vary along the page).

In most cases grayscale will achieve better results. However, as grayscale images are much larger than black-and-white images, 300 dpi (instead of 600 dpi) may be mandatory when using grayscale due to disk consumption requirements.

Remark: Try to limit yourself to the optical resolution oferred by the scanner. Most old scanners are 300 dpi, but the scanning software obtains higher resolutions through interpolation. Newer scanners may be optical 600 dpi or 1200 dpi or more.

ArrayHistogram-based thresholding is the default method. It computes automatically a thresholding value based on the distribution of grayshades. To use it, just enter the TUNE tab and select (it's selected by default) the "use histogram-based global thresholder". To make a try, load a PGM image and press OCR or ask the Segmentation OCR step.

Remark: You can correct the automatic-detected threshold with "Threshold factor" in Tune tab.

ArrayGlobal thresholding does not address those cases where the printing intensity (or paper properties) vary along one same page. Local thresholding methods are required on such cases. Clara OCR implements a classification-based local (per-symbol) thresholder. Saying that it's classification-based means that the OCR engine is used to choose the threshold. In other words, the threshold chosen is that for which the classifier successfully recognized the symbol (in fact, this is a brute-force approach).

The local binarizer can be manually applied at any symbol. To do so, load one PGM page and click any symbol directly on the PAGE tab. Two thresholding values will be chosen. The pixels found to be "black" for each one are painted "black" (smaller value) and "gray" (larger value). At this moment, it's possible to add the thresholded symbol as a pattern (just press the key corresponding to its transliteration). Remember that this thresholder relies on the classifier, so if the OCR is not trained, you'll get no benefit.

Two versions of the local binarizer were developed, a "weak" one and a "strong" one. The "weak" one just tries to change the threshold on those symbols not successfully classified using the default threshold. The "strong" one (unfinished) also tries to criticize locally the segmentation results. By default, the weak version is used. To try the strong one, check the corresponding checkbox at the TUNE tab.

ArraySometimes the printing is skewed relatively to the paper margins. Skew is a problem to the OCR heuristics. As the Clara OCR engine just detects components by pixel contiguity and builds classes of symbols, in practice the effect of skew will be a larger number of patterns, and therefore a larger revision cost.

In some cases, a careful manual scanning can solve the problem. When acceptable, a set-square solves the problem: just align one text line at one set-square rule and the edge of the scanner glass at the other rule (we're supposing that the bookbinding was disassembled).

ArrayPatterns are selected symbols from the book. They're obtained from manual training, or from automatic selection. The patterns are used to deduce the transliteration of the unknown symbols by the bitmap comparison heuristics. In other words, the OCR discovers that one symbol is the letter "a" or the digit "1" comparing it with the patterns.

The book font is the collection of all patterns. The term "book font" was chosen to make sure that we're not talking about the X font used by the GUI. The book font is stored on a separate file ("patterns", on the work directory). Clara OCR classifies the patterns into "types", one type for each printing font. By now, most of this work must be done manually. Someday in the future, the auto-tuning features and the prebuild customizations will hopefully make this process less painful.

ArrayCurrently, symbol classification can be performed by three different classifiers: skeleton fitting, border mapping or pixel distance. The choice is done on the TUNE tab. Border mapping is currently experimental. Pixel distance has been used as an auxiliar classifier. Skeleton fitting is a more mature code and is highly customizable. It's the default classification method by now.

ArrayTo classify the book symbols (i.e. to discover the transliteration of unknown symbols using the patterns), enter Clara OCR, select "Work on all pages" ("Options" menu) and press the OCR button using the mouse button 1, or press the mouse button 3 and select "Classification". The classification may be performed many times. Each time, different parameters may be tried to refine the results already achieved.

When the classification finishes, observe the pages 5.pbm and 6.pbm. Much probably, some symbols will be greyed. In other words, the classifier was unable to classify all symbols. The statistics presented on the PAGE (LIST) tab may be useful now. To reduce the number of unknown symbols there are three choices: add more patterns, change the skeleton computation parameters, or try another classifier.

To add more patterns, just train some greyed symbols and reclassify all pages again. The reclassification will be faster than the first classification because most symbols, already classified, won't be touched.

To change the skeleton computation parameters, exit Clara OCR, restart it informing the new parameters through -k, select "Re-scan all patterns" ("Edit" menu), select "Work on all pages" ("Options" menu) and reclassify. May be easier to choose and set the new parameters using the TUNE (SKEL) tab, as explained earlier. However, remember that the parameters chosen through the TUNE (SKEL) tab override the parameters informed through -k.

To try another classifier, first select the "Re-scan all patterns" entry on the "Edit" menu. Then enter the TUNE tab and select the classifier to use from the available choices (skeleton-base, border mapping and pixel distance). The pixel distance may be a good choice. Then reclassify all pages.

ArrayAt this point, we can generate the output for all pages. The output is already available if the classification was performed clicking the OCR button with mouse button 1. If not, just select the "Work on all pages" item on the "Options" menu, and click the OCR button using the mouse button 1. The per-page output will be saved to the files 5.html and 6.html.

Maybe the output will contain unknow symbols. Maybe the output presents broken lines or broken words. If so, the numbers used to perform symbol alignment must be changed. These numbers are configured on the TUNE tab ("Magic numbers" section). They're part of the session data, so they'll be saved to disk.

ArrayTo OCR an entire book is a long process. Perhaps along it a problem is detected. Bad choice of skeleton computation parameters, or a bad page contaminating the bookfont, some files loss due to a crash, etc. How to solve them?

Array3.5 Removing a page
From the stats presented by the PAGE (LIST) tab it's possible to detect problems on specific pages. A low factorization may be a simptom of a bad choice of brightness for that page. In such a case, it's probably a good idea to remove completely that page.

Array3.8 Importing revision data
When OCRing a large book, a good approach is to divide its pages into a number of smaller sections and OCR each one. So for a book with, say, 1000 pages, we could OCR pages 1-200, then 201-400, etc.

After finishing the first section, of course we desire reuse on the second section the training and revision effort already spent. This is not the same as adding the pages 201-400 to the first section, because we do not want handle the pages 1-200 anymore.

ArrayTypes of revision acts (to be written).

ArrayThe "page (list)" tab offers recognition statistics on a per-page basis. The contents of each column on this tab is described below:

POS: The sequential position on the list. The current page is informed by an asterisk on this column.

FILE: The name of the file that contains the PBM image of the document.

RUNS: The number of OCR runs on this page. Partial OCR runs, like classification (started by the "classify" button also count as one run.

TIME: Total CPU time wasted with OCR operations on this page. I/O time (reading and saving session files) is not included.

WORDS: Current number of words on this page. This variable is updated by the "build" step.

SYMBOLS: Current number of symbols on this page. This variable is updated by the "build" step.

DOUBTS: Current number of untransliterated CHAR symbols on this page. This variable is updated by the "build" step.

CLASSES: Current number of classes on this page.

FACT: Quotient between the number of symbols and the number of classes.

RECOG: Quotient between (symbols-doubts) and symbols, where "symbols" is the number of symbols and "doubts" is the number of doubts as defined above.

ArrayThe application window is divided into three major areas: the buttons ("zoom", "OCR", "stop", etc) the "plate" (right), including the tabs ("page", "symbol" and "font"), and one or more "document windows" inside the plate.

We say "document window" because each window is exhibiting one "document". This "document" may be the scanned page (PAGE window), the current OCR output for this page (PAGE OUTPUT window), the symbol form (PAGE SYMBOL window), the GPL (GPL window) and so on. However, we'll refer the document windows merely as "windows".

ArrayThree tabs are oferred, and each one may operate in one or more "modes". For instance, pressing the PATTERN tab many times will circulate two modes: one presenting the windows "pattern" and "pattern (props)" and another with the window "pattern (list)".

ArrayThe application buttons are those displayed on the left portion of the Clara X window. They're labelled "zoom", "OCR", etc. Three types of buttons are available. There are on/off buttons (like "italic"), multistate buttons (like the alphabet button), where the state is informed by the current label, and there are buttons that merely capture mouse clicks, like the "zoom" button. Some are sensible both to mouse button 1 and to mouse button 3, others are sensible only to mouse button 1.

zoom - enlarge or reduce bitmaps. The mouse buttom 1 enlarge bitmaps, the mouse button 3 reduce bitmaps. The bitmaps to enlarge or reduce are determined by the current window. If the PAGE window is active, then the scanned document is enlarged or reduced. If the PAGE (fatbits) or the PATTERN window is active, then the grid is enlarged or reduced. If the PAGE (symbol) or the PATTERN (props) or the PATTERN (list) window is active, then the web clip is enlarged or reduced.

OCR - start a full OCR run on the current page or on all pages, depending on the state of the "Work on current page only" item of the Options menu.

stop - stop the current OCR run (if any). OCR does not stop immediately, but will stop as soon as possible.

zone - start definition of the OCR zone. Currently zoning in Clara OCR is useful only for saving the zone can as a PBM file, using the "save zone" item on the "File" menu. By now, only one zone can be defined and the OCR operations consider the entire document, ignoring the zone.

type - read-only button, set accordingly to the pattern type of the current symbol or pattern. The various letter sizes or styles (normal, footnote, etc) used by the book are numbered from 0 by Clara OCR ("type 0", "type 1", etc).

bad - toggles the button state. The bad flag is used to identify damaged bitmaps.

ArrayWhen the "Show alphabet map" option of the "View" menu is selected, the GUI will include an alphabet map between the buttons and the plate. This map presents all symbols from the current alphabet. The current alphabet is selected using the alphabet button. The alphabet button circulates all alphabets selected on the "Alphabets" menu.

ArrayThis menu is activated from the menu bar on the top of the application X window.

ArrayThis item selects the alphabets that will be available on the alphabets button.

Arabic This is a provision for future support of Arabic alphabet.

ArrayThis menu is activated when the mouse button 3 is pressed on the PAGE window.

ArrayThis menu is activated when the mouse button 3 is pressed on the PAGE.

ArrayThis menu is activated when the mouse button 3 is pressed on the OCR button. It allows running specific OCR steps (all steps run in sequence when the OCR button is pressed).

ArrayClara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.

Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.

Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.

Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.

Adriano Nagelschmidt Rodrigues donated a 15" monitor.

The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.

The names cited by the CHANGELOG (and not cited before) follow (small patches, bug reports, specfiles, suggestions, explanations, etc).

Brian G. (win32), Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, Franz Bakan (OS/2), groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (NetBSD), Tim McNerney, Tyler Akins.

docs.sk

comprehensive documentation repository

Most Viewed

clara(1)

NAME

SYNOPSIS

DESCRIPTION

CONTENTS