clara(1)

NAME

clara - a cooperative OCR

SYNOPSIS

clara [options]

DESCRIPTION

Welcome. Clara OCR is a free OCR, written for systems supporting the C library and the X Windows System. Clara OCR is intended for the cooperative OCR of books. There are some screenshots available at http://www.claraocr.org/.

This documentation is extracted automatically from the comments of the Clara OCR source code. It is known as "The Clara OCR Developer's Guide". It's currently unfinished. First-time users are invited to read "The Clara OCR Tutorial". There is also an advanced manual known as "The Clara OCR Advanced User's Manual".

ArrayClara OCR relies on the memory allocator both for allocation or resizing of some large blocks used by the main data structures, and for allocation of a large number of small arrays. Currently Clara OCR does not include or use an special memory allocator, but implements an interface to realloc. The alloca function is also used sometimes along the code, generally to allocate buffers for sorting arrays.

ArrayConcerning security, the following criteria is being used:

1. string operations are generally performed using services that accept a size parameter, like snprint or strncpy, except when the code itself is simple and guarantees that a overflow won't occur.

2. The CGI clara.pl invokes write privileges through sclara, a program specially written to perform only a small set of simple operations required for the operation of the Clara OCR web interface.

The following should be done:

ArrayA naive support for runtime index checking is provided through the macro checkidx. This checking is performed only if the code is compiled with the macro MEMCHECK defined and the command-line switch

ArrayClara OCR decides at runtime if the GUI will be used or not. So even when using Clara OCR in batch mode (-b command-line switch), linking with the X libraries is required.

ArrayClara OCR uses a lot of global variables. Large data structures, flags, paths, etc, use stored on global variables. In some cases we use a naming strategy to make the code more readable. The important cases are:

ArrayIn order to allow the GUI to refresh the application window while one OCR run is in course, Clara does not use multiple threads. The main function alternates calls to xevents() to receive input and to continue_ocr() to perform OCR. As the OCR operations may take long to complete, a very simple model was implemented to allow the OCR services to execute only partially.

ArrayEven for non-developers, a knowledge of the internal data structures used by Clara OCR is required for fine tuning and to make simple diagnostics.

ArrayAs one character of the document may be composed by two or more closures (for instance when it's broken), it's convenient to work not with closures, but with sets of closures. So we define the concept of "symbol" as being a set of one or more closures. Initially, the OCR generates one unitary symbol for each closure. Subsequent steps may define new symbols composed by two or more closures.

ArrayEach symbol is stored in a sdesc structure. Those structures form the mc array. Once created, a symbol is never deleted. So it's index on the mc array identifies it (this is important for the web-based revision procedure). Note that closures and symbols are numbered on a documentrelated basis. The set of closures that define one symbol never changes. So the symbol bounding box and the total number of black pixels also won't change either.

ArrayOne same closure may belong to more than one symbol. This is important in order to allow various heuristic trials. For instance, the left closure of the "u" on the preceding section could be identified as the body of an "i". In this case however we would not find its dot. So the heuristic could try by chance another solution, for instance to join it with the nearest closure (in that case, the right closure of the "u") and try to match it with some pattern of the font.

ArrayThe font size is important for classifying all book symbols on pattern "types". For instance, books generally use smaller letters for footnotes. This classification is performed automatically by Clara OCR and presented by the "PATTERN (types)" window.

ArrayThe vertical alignment of symbols is important for various heuristics. For instance, the vertical line from a broken "p" matches an "l", but using alignement tests we're able to refuse this match.

ArrayClara OCR applies The concept of "symbol" to atomic symbols like letters, digits or punctuation signs. Words (as "house" or "peace"), are handled by Clara OCR as sequences of symbols.

It's very important to compute the words of the page. They provide a context both to the OCR and to the reviewer. For instance, if the known symbols of some word were identified as bold, then Clara will automatically make the bold button on when someone tries to review the unknown symbols of that word. The same applies to prefer the recognition of one symbol as the digit "1" instead of the character "l" if the known symbols of the "word" are digits. Words are also the basis for revision based on spelling. Each words is stored on a wdesc structure on the "word" array.

ArrayThe "acts" or "revision acts" are the human interventions for training a symbol, merging a fragment to one symbol, etc.

As the human interventions are the more precious source of information, Clara logs all revision acts, in order to be able to reuse them.

The transliterations are obtained from the revision acts, so each transliteration refers one (or more) revision acts, and also inherits some properties from that act (or those acts).

The acts are on the book scope, and not on the page scope. The acts are stored on the file "acts" on the work directory.

ArrayClara OCR maintains a list of 0 or more proposed or deduced transliterations for each symbol. Along the OCR process, each transliteration receives "votes" from reviewers (REVISION votes) or from machine deduction heuristics, based on shape similarity (SHAPE votes) or on spelling (SPELLING votes).

So the choice of the "best" transliteration is performed through election. Votes are stored on structures of type vdesc, and transliterations are stored on structures of type trdesc. Each symbol stores a pointer for a (possibly empty) list of transliterations and each transliteration stores a pointer for a (possibly empty) list of votes.

ArrayThe election process used to choose the "best" transliteration for one symbol (from those obtained through human revision or heuristics based on shape similarity or spelling) consists in computing the "preference" of each transliteration and choosing the one with maximum preference.

ArrayOnce we have computed the "best" transliteration, we can compute its transliteration class, important for various heuristics. From the transliteration class it's possible test things like "do we know the transliteration of this symbol?" or "is it an alphanumeric character?" or "concerning dimension and vertical alignment could it be an alphanumeric character?", and others.

There are two moments where the transliteration class is computed. The first is when a transliteration is added to the symbol, and the second is when the CHAR class is propagated.

The first uses the following criteria to compute the transliteration class:

1. If the symbol has no transliteration at all, its class is UNDEF.

2. On all other cases, the transliteration with largest preference will be classified as DOT, COMMA, NOISE, ACCENT and others. This search is implemented by the classify_tr function in a straightforward way.

Array3.1 Skeleton pixels
The first method implemented by Clara OCR for symbol classification was skeleton fitting. Two symbols are considered similar when each one contains the skeleton of the other.

Clara OCR implements five heuristics to compute skeletons. The heuristic to be used is informed through the command-line option -k as the SA parameter. The value of SA may be 0, 1, 2, 3 or 4.

Heuristics 0, 1 and 2 consider a pixel as being a skeleton pixel if it is the center of a circle inscribed within the closure, and tangent to the pattern boundary in more than one point.

The discrete implementation of this idea is as follows: for each pixel p of the closure, compute the minimum distance d from p to some boundary pixel. Now try to find two pixels on the closure boundary such that the distance from each of them to p does not differ too much from d (must be less than or equal to RR). These pixels are called "BPs".

To make the algorithm faster, the maximum distance from p to the boundary pixels considered is RX. In fact, if there exists a square of size 2*BT+1 centered at p, then p is considered a skeleton pixel.

As this criteria alone produces fat skeletons and isolated skeleton pixels along the closure boundary, two other conditions are imposed: the angular distance between the radiuses from p to each of those two pixels must be "sufficiently large" (larger than MA), and a small path joining these two boundary pixels (built only with boundary pixels) must not exist (the "joined" function computes heuristically the smallest boundary path between the two pixels, and that distance is then compared to MP).

The heuristics 1 and 2 are variants of heuristic 0:

1. (SA = 1) The minimum linear distance between the two BPs is specified as a factor (ML) of the square of the radius. This will avoid the conversion from rectangular to polar coordinates and may save some CPU time, but the results will be slightly different.

2. (SA = 2) No minimum distance checks are performed, but a minimum of MB BPs is required to exist in order to consider the pixel p a skeleton pixel.

The heuristic 3 is very simple. It computes the skeleton removing BT times the boundary.

The heuristic 4 uses "growing lines". For angles varying in steps of approximately 22 degrees, a line of lenght RX pixels is drawn from each pixel. The heuristic check if the line can or cannot be entirely drawn using black pixels. Depending on the results, it decides if the pixel is an skeleton pixel or not. For instance: if all lines could be drawn, then the pixel is center of an inscribed circle, so it's considered an skeleton pixels. All considered cases can be found on the source code.

The heuristic 5 computes the distance from each pixel to the border, for some definition of distance. When the distance is at least RX, it is considered a skeleton pixel. Otherwise, it will be considered a skeleton pixel if its distance to the border is close to the maximum distance around it (see the code for details).

ArrayPairing applies to letters and digits. We say that the symbols a and b (in this order) are paired if the symbol b follows the symbol a within one same word. For instance, "h" and "a" are paired on the word "that", "3" and "4" are paired on "12345", but "o" and "b" are not paired on "to be" (because they're not on the same word).

The function s_pair tests symbol pairing, and returns the following diagnostics:

0 .. the symbols are paired 1 .. insuficcient vertical intersection 2 .. one or both symbols above ascent 3 .. one or both symbols below descent 4 .. maximum horizontal distance exceeded 5 .. incomplete data 6 .. different zones

If p is nonzero, then store the inferred alignment for each symbol (a and b) on the va field of these symbols, except when these symbols have the va field already defined.

ArrayThe "build" OCR step, implemented by the "build" function, distributes the symbols on words (analysing the distance, heights and relative position for each pair of symbols), and the words on lines (analysing the distance, heights and relative position for each pair of words). Various important heuristics take effect here.

0. Preparation

The first step of build is to distribute the symbols on words. This is achieved by:

a. Undefining the next-symbol ("E" field) and previous-symbol ("W" field) links for each symbol, the surrounding word ("sw" field) of each symbol, and the next signal ("sl" field) for each symbol.

Remark: The next-symbol and previous symbol links are used to build the list of symbols of each word. For instance, on the word "goal", "o" is the next for "g" and the previous for "a", "g" has no previous and "l" has no next).

b. Undefining the transliteration class of SCHARs and the uncertain alignment information.

2. Distributing symbols on words

The second step is, for each CHAR not in any word, define a unitary word around it, and extend it to right and left applying the symbol pairing test.

3. Computing the alignment using the words

Some symbols do not have a well-defined alignment by themselves. For instance, a dot may be baseline-aligned (a final dot) or 0-aligned (the "i" dot). So when computing their alignments, we need to analyse their neighborhoods. This is performed in this step.

4. Validating the recognition

Shape-based recognitions must be validated by special heuristics. For instance, the left column of a broken "u" may be recognized as the body of an "i" letter. A validation heuristic may refuse this recognition for instance because the dot was not found. These heuristics are peralphabet.

5. Creating fake words for punctuation signs

To produce a clean output, symbols that do not belong to any word are not included on the OCR output. So we need to create fake words for punctuation signs like commas of final dots.

6. Aligning words

Words need to be aligned in order to detect the page text lines. This is perfomed as follows:

a. Undefine the next-word and previous-word links for each word. These are links for the previous and next word within lines. For instance, on the line "our goal is", "goal" is the next for "our" and the previous for "is", "our" has no previous and "is" has no next.

b. Distribution of the words on lines. This is just a matter of computing, for each word, its "next" word. So for each pair of words, we test if they're "paired" in the sense of the function w_pair. In affirmative case, we make the left word point to the right word as its "next" and the rigth point to the left as its "previous".

The function w_pair does not test the existence of intermediary words. So on the line "our goal is" that function will report pairing between "our" and "is". So after detecting pairing, our loop also checks if the detected pairing is eventually "better" than those already detected.

c. Sort the lines. The lines are sorted based on the comparison performed by the function "cmpln".

7. Computing word properties

Finally, word properties can be computed once we have detected the words. Some of these properties are applied to untransliterated symbols. The properties are:

1. The baseline left and right ordinates.

2. The italic and bold flags.

3. The alphabet.

4. The word bounding box.

Array3.5 Synchronization
3.6 The function list_cl
The function list_cl lists all closures that intersect the rectangle of height h and width w with top left (x,y). The result will be available on the global array list_cl_r, on indexes 0..list_cl_sz-1. This service is used to clip the closures or symbols (see list_s) currently visible on the PAGE window. It's also used by OCR operations that require locating the neighbours of one closure or symbol (see list_s).

The parameter reset must be zero on all calls, except on the very first call of this function after loading one page.

Every time a new page is loaded, this service must be called informing a nozero value for the reset parameter. In this case, the other parameters (x, y, w and h) are ignored, and the effect will be preparing the page-specific static data structures used to speed up the operation.

Closures are located by list_cl from the static lists of closures clx and cly, ordered by leftmost and topmost coordinates. Small and large closures are handled separately. The number of closures with width larger than FS is counted on nlx. The number of closures with height larger than FS is counted on nly.

The clx array is divided in two parts. The left one contains (topcl+1)-nlx indexes for the closures with width not larger than FS, sorted by the leftmost coordinate. The right one contains the other indexes, in descending order.

The cly array is divided in two parts. The left one contains (topcl+1)-nly indexes for the closures with height not larger than FS, sorted by the topmost coordinate. The right one contains the other indexes, in descending order.

Array4.1 Main characteristics
1. Clara OCR GUI uses only 5 colors: white, gray, darkgray, verydarkgray and black. The RGB value for each one is customizable at startup (-c command-line option). On truecolor displays, graymaps are displayed using more graylevels than the 5 listed above.

2. The X I/O is not buffered. Buffered X I/O is implemented but it's not being used.

3. Only one X font is used for all needs (button lables, menu entries, HTML renderization, and messages).

4. Asynchronous refresh. The OCR operations just set the redraw flags (redraw_button, redraw_wnd, redraw_int, etc) and let the redraw() function make its work.

ArrayThe current window is informed through the CDW global variable (set by the setview function). The variable CDW is an index for the dw array of dwdesc structs. Some macros are used to refer the fields of the structure dw[CDW]. The list of all them can be found on the headers under the title "Parameters of the current window".

ArrayThe Bitmaps on HTML windows and on the PAGE window are exhibited in "reduced" fashion (a square RFxRF of pixels from the bitmap is mapped to one display pixel). If RF=1, then each bitmap pixel will map to one display pixel.

ArrayClara is able to read a piece of HTML code, render it, display the rendered code, and attend events like selection of an anchor, filling a text field, or submitting a form. Note that anchor selection and form submission activate internal procedures, and won't call external customizable CGI programs.

Most windows displayed by Clara are produced using this HTML support. When the "show HTML source" option on the "View" menu is selected, Clara will display unrendered HTML, and it will become easier to identify the HTML windows. Note that all HTML is produced by Clara itself. Clara won't read HTML from files or through HTTP.

Perhaps you are asking why Clara implements these things. Clara does not intend to be a web browser. Clara supports HTML because we were in need of a forms interface, and the HTML forms is around there, ready to be used, and extensively proved on practice as an easy and effective solution. Note that we're not trying to achieve completeness. Clara HTML support is partial. There is only one font available, tables cannot be nested and most options are unavailable, PBM is the only graphic format supported, etc. However, it attends our needs, and the code is surprisingly small.

Let's understand how the HTML windows work. First of all, note that there is a html flag on the structure that defines a window (structure dwdesc). For instance, this flag is on for the window OUTPUT (initializition code at function setview).

When the function redraw is called and the window OUTPUT is visible on the plate, the service draw_dw will be called informing OUTPUT through the global variable CDW (Current Window). However, before making that, redraw will test the flag RG to check if the HTML contents for the OUTPUT window must be generated again, calling a function specific to that window. For instance, when a symbol is trained, this flag must be set in order to notify asynchronously the need to recompute the window contents, and render it again.

ArrayThe rendering of each element on the HTML page creates one graphic element ("GE" for short).

Free text is rendered to one GE of type GE_TEXT per word. This is a "feature". The rendering procedures are currently unable to put more than one text word per GE.

IMG tags are rendered to one GE of type GE_IMG. Note that the value of the SRC element cannot be the name of a file containing a image, but must be "internal" or "pattern/n". These are keywords to the web clip and the bitmap the pattern "n". The value of the SRC attribute is stored on the "txt" field of the GE.

INPUT tags with TYPE=TEXT are rendered to one GE of type GE_INPUT. The predefined value of the field (attribute VALUE) is stored on the field "txt" of the GE. The name of the field (attribute NAME) is stored on the field "arg" of the GE.

The Clara OCR HTML support added INPUT tags with TYPE=NUMBER. They're rendered like TYPE=TEXT, but two steppers are added to faster selection. So such tags will create three GEs (left stepper, input field, and right stepper).

INPUT tags with TYPE=CHECKBOX are rendered to one GE of type GE_CBOX. The variable name (attribute NAME) is stored on the "arg" field. The argument to VALUE is stored on the field "txt". The status of the checkbox is stored on the "iarg" field (2 means "checked", 0 means "not checked").

INPUT tags with TYPE=RADIO are rendered just like CHECKBOX. The only difference is the type GE_RADIO instead GE_CBOX.

SELECT tags (starting a SELECT element) are rendered to one GE of type SELECT. In fact, the entire SELECT element is stored on only one GE. Each SELECT element originates one standard context menu, as implemented by the Clara GUI. The "iarg" field stores the menu index. The free text on each OPTION element is stored as an item label on the context menu. The implementation of the SELECT element is currently buggy: (a) for each renderization, one entry on the array of context menus will be allocated, and will never be freed, and (b) The attribute NAME of the SELECT won't be stored anywhere.

INPUT tags with TYPE=SUBMIT are rendered to one GE of type GE_SUBMIT. The value of the attribute VALUE is stored on the "txt" field. The value of the ACTION attribute is stored on the field "arg". The field "a" will store HTA_SUBMIT.

TD tags are rendered to one GE of type GE_RECT. The value of the BGCOLOR attribute is stored on the "bg" field as a code (only the colors known by the Clara GUI are supported: WHITE, BLACK, GRAY, DARKGRAY and VDGRAY). The coordianates of the cell within the table are stored on the fields "tr" and "tc".

ArrayThe Clara OCR GUI tries to apply immediately all actions taken by the user. So the HTML forms (e.g. the PATTERN window) do not contain SUBMIT buttons, because they're not required (some forms contain a SUBMIT button disguised as a CONSIST facility, but this is just for the user's convenience).

The editable input fields make auto-submission mechanisms a bit harder, because we cannot apply consistency tests and process the form before the user finishes filling the field, so auto-submission must be triggered on selected events. The triggers must be a bit smart, because some events must be attended before submission (for instance toggle a CHECKBOX), while others must be attended after submission (for instance changing the current tab). So auto-submission must be carefully studied. The current strategy follows:

a. When the window PAGE (symbol) or the window PATTERN is visible, auto-submit just after attending the buttons that change the current symbol/pattern data (buttons BOLD, ITALIC, ALPHABET or PTYPE).

b. When the window PAGE (symbol) or the window PATTERN is visible, auto-submit just before attending the left or right arrows.

c. When the user presses ENTER and an active input field exists, autosubmit.

d. Auto-submit as the first action taken by the setview service, in order to flush the current form before changing the current tab or tab mode.

e. Auto-submit just after opening any menu, in order to flush data before some critic action like quitting the program or starting some OCR step.

f. Auto-submit just after attending CHECKBOX or RADIO buttons.

Array5.5 The function show_hint
Messages are put on the status line (on the bottom of the application X window) using the show_hint service. The show_hint service receives two parameters: an integer f and a string (the message).

If f is 0, then the message is "discardable". It won't be displayed if a permanent message is currently being displayed.

If f is 1, then the message is "fixed". It won't be erased by a subsequent show_hint call informing as message the empty string (in practical terms, the pointer motion won't clear the message).

If f is 2, then the message is "permanent" (the message will be cleared only by other fixed or permanent message).

ArrayStarts a complete OCR run or some individual OCR step on one given page, or on all pages. For instance, start_ocr is called by the GUI when the user presses the "OCR" button or when the user requests loading of one specific page.

In fact, almost all user requested operation is performed as an "ocr step"in order to take advantage from the execution model implemented by the function continue_ocr. So start_ocr is the starting point for attending almost all user requests.

If p is -1, process all pages, if p < -1, process only the current page (cpage) otherwise process only the page p. If s>=0 then run only step s, otherwise run all steps.

Array6.1 How to add a bitmap comparison method It's not hard to add a bitmap comparison method to Clara OCR. This may become very important when the available heuristics are unable to classify the symbols of some book, so a new heuristic must be created. In order to exemplify that, we'll add a naive bitmap comparison method. It'll just compare the number of black pixels on each bitmap, and consider that the bitmaps are similar when these numbers do not differ too much.

Please remember that the code added or linked to Clara OCR must be GPL.

In order to add the new bitmap comparison method, we need to write a function that compares two bitmaps returning how similar they are, add this function as an alternative to the Options menu, and call it when classifying the page symbols. We'll perform all these steps adding a naive comparison method, step by step. The more difficult one is to write the bitmap comparison method. This step is covered on the subsection "How to write a bitmap comparison function".

Let's present the other two steps. To add the new method to the Options menu, we need:

ArrayThese are the steps to add a new button:

Array(Some) Major tasks

1. Vertical segmentation (partially done).

2. Heuristics to merge fragments.

3. Spelling-generated transliterations

4. Geometric detection of lines and words

5. Finish the documentation

6. Simplify the revision acts subsystem

Minor tasks

1. Change sprintf to snprintf.

2. Fix assymetric behaviour of the function "joined".

3. Optimize bitmap copies to copy words, not bits, where possible (partially done).

4. Support Multiple OCR zones (partially done).

5. Make sure that the access to the data structures is blocked during OCR (all functions that change the data structures must check the value of the flag "ocring").

6. Use 64-bit integers for bitmap comparisons and support big-endian CPUs (partially done).

7. Clear memory buffers before freeing.

8. Allow the transliterations to refer multiple acts (partially done).

9. Rewrite composition of patterns for classification of linked symbols.

10. The flea stops but do not disappear when the window lost and regain focus.

11. Substitute various magic numbers by per-density and per-minimumfontsize values.

ArrayClara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati wrote the internal preprocessor. Clara OCR includes bugfixes produced by other developers. The Changelog (http://www.claraocr.org/CHANGELOG) acknowledges all them (see below). Imre Simon contributed high-volume tests, discussions with experts, selection of bibliographic resources, propaganda and many ideas on how to make the software more useful.

Ricardo authored various free materials, some included (at least) in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator "conjugue", the ispell dictionary br.ispell and the proxy axw3). He recently ported the EiC interpreter to the Psion 5 handheld and patched the Xt-based vncviewer to scale framebuffers and compute image diffs. Ricardo works as an independent developer and instructor. He received no financial aid to develop Clara OCR. He's not an employee of any company or organization.

Imre Simon promotes the usage and development of free technologies and information from his research, teaching and administrative labour at the University.

Roberto Hirata Junior and Marcelo Marcilio Silva contributed ideas on character isolation and recognition. Richard Stallman suggested improvements on how to generate HTML output. Marius Vollmer is helping to add Guile support. Jacques Le Marois helped on the announce process. We acknowledge Mike O'Donnell and Junior Barrera for their good criticism. We acknowledge Peter Lyman for his remarks about the Berkeley Digital Library, and Wanderley Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior for some web and bibliographic pointers. Bruno Barbieri Gnecco provided hints and explanations about GOCR (main author: Jorg Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently supporting our tentatives of using portions of his code. Adriano Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried the tutorial before the first announce. Eduardo Marcel Macan packaged Clara OCR for Debian and suggested some improvements. Mandrakesoft is hosting claraocr.org. We acknowledge Conectiva and SuSE for providing copies of their outstanding distributions. Finally, we acknowledge the late Jose Hugo de Oliveira Bussab for his interest in our work.

Adriano Nagelschmidt Rodrigues donated a 15" monitor.

The fonts used by the "view alphabet map" feature came from Roman Czyborra's "The ISO 8859 Alphabet Soup" page at http://czyborra.com/charsets/iso8859.html.

The names cited by the CHANGELOG (and not cited before) follow (small patches, bug reports, specfiles, suggestions, explanations, etc).

Brian G. (win32), Bruce Momjian, Charles Davant (server admin), Daniel Merigoux, De Clarke, Emile Snider (preprocessor, to be released), Erich Mueller, Franz Bakan (OS/2), groggy, Harold van Oostrom, Ho Chak Hung, Jeroen Ruigrok, Laurent-jan, Nathalie Vielmas, Romeu Mantovani Jr (packager), Ron Young, R P Herrold, Sergei Andrievskii, Stuart Yeates, Terran Melconian, Thomas Klausner (NetBSD), Tim McNerney, Tyler Akins.

docs.sk

comprehensive documentation repository

Most Viewed

clara(1)

NAME

SYNOPSIS

DESCRIPTION

CONTENTS