approx(3)

NAME

String::Approx - Perl extension for approximate matching
(fuzzy matching)

SYNOPSIS

use String::Approx 'amatch';
print if amatch("foobar");
my @matches = amatch("xyzzy", @inputs);
my @catches = amatch("plugh", ['2'], @inputs);

DESCRIPTION

String::Approx lets you match and substitute strings
approximately. With this you can emulate errors: typing
errorrs, speling errors, closely related vocabularies
(colour color), genetic mutations (GAG ACT), abbreviations
(McScot, MacScot).

The measure of approximateness is the Levenshtein edit distance. It is the total number of "edits": insertions,: word world
deletions,: monkey money
and substitutions: sun fun
required to transform a string to another string. For example, to transform "lead" into "gold", you need three edits:: lead gead goad gold
The edit distance of "lead" and "gold" is therefore three.

MATCH

use String::Approx 'amatch';

$matched = amatch("pattern")
$matched = amatch("pattern", [ modifiers ])

$any_matched = amatch("pattern", @inputs)
$any_matched = amatch("pattern", [ modifiers ],
@inputs)
@match = amatch("pattern") @match = amatch("pattern", [ modifiers ])
@matches = amatch("pattern", @inputs) @matches = amatch("pattern", [ modifiers ],
@inputs)
Match pattern approximately. In list context return the matched @inputs. If no inputs are given, match against
the $_. In scalar context return true if any of the
inputs match, false if none match.
Notice that the pattern is a string. Not a regular expression. None of the regular expression notations (^, ., *, and so on) work. They are characters just like the others. Note-on-note: some limited form of "regular expressionism" is planned in future: for example character classes ([abc]) and any-chars (.). But that feature will be turned on by a special modifier (just a guess: "r"), so there should be no backward compatibility problem.
Notice also that matching is not symmetric. The inputs are matched against the pattern, not the other way round. In other words: the pattern can be a substring, a sub match, of an input element. An input element is always a superstring of the pattern.
MODIFIERS
With the modifiers you can control the amount of approxi mateness and certain other control variables. The modi fiers are one or more strings, for example "i", within a string optionally separated by whitespace. The modifiers are inside an anonymous array: the "[ ]" in the syntax are not notational, they really do mean "[ ]", for example "[ "i", "2" ]". "["2 i"]" would be identical.
The implicit default approximateness is 10%, rounded up. In other words: every tenth character in the pattern may be an error, an edit. You can explicitly set the maximum approximateness by supplying a modifier like: number
number%
Examples: "3", "15%".
Using a similar syntax you can separately control the max imum number of insertions, deletions, and substitutions by prefixing the numbers with I, D, or S, like this:: Inumber
Inumber%
Dnumber
Dnumber%
Snumber
Snumber%
Examples: "I2", "D20%", "S0".
You can ignore case ("A" becames equal to "a" and vice versa) by adding the "i" modifier.
For example: [ "i 25%", "S0" ]
means ignore case, allow every fourth character to be "an edit", but allow no substitutions. (See NOTES about dis allowing substitutions or insertions.)

SUBSTITUTE

use String::Approx 'asubstitute';

@substituted = asubstitute("pattern", "replace
ment") @substituted = asubstitute("pattern", "replace
ment", @inputs) @substituted = asubstitute("pattern", "replace
ment", [ modifiers ]) @substituted = asubstitute("pattern", "replace
ment",: [ modifiers ], @inputs)
Substitute approximate pattern with replacement and return as a list <copies> of @inputs, the substitutions having
been made on the elements that did match the pattern. If
no inputs are given, substitute in the $_. The replace
ment can contain magic strings $&, $`, $' that stand for
the matched string, the string before it, and the string
after it, respectively. All the other arguments are as in
"amatch()", plus one additional modifier, "g" which means
substitute globally (all the matches in an element and not
just the first one, as is the default).
See "BAD NEWS" about the unfortunate stinginess of "asub stitute()".

INDEX

use String::Approx 'aindex';

$index = aindex("pattern")
@indices = aindex("pattern", @inputs)
$index = aindex("pattern", [ modifiers ])
@indices = aindex("pattern", [ modifiers ], @in
puts)
Like "amatch()" but returns the index/indices at which the pattern matches approximately. In list context and if @inputs are used, returns a list of indices, one index for each input element. If there's no approximate match, "-1" is returned as the index.
There's also backwards-scanning "arindex()".

SLICE

use String::Approx 'aindex';

($index, $size) = aslice("pattern")
([$i0, $s0], ...) = aslice("pattern", @inputs)
($index, $size) = aslice("pattern", [ modifiers
]) ([$i0, $s0], ...) = aslice("pattern", [ modifiers
], @inputs)
Like "aindex()" but returns also the size of the match. If the match fails, returns an empty list (when matching against $_) or an empty anonymous list corresponding to the particular input.
Note that the size of the match will very probably be something you did not expect (such as longer than the pat tern). This may or may not be fixed in future releases.
If the modifier: "minimal_distance"
is used, the minimal possible edit distance is returned as the third element:: ($index, $size, $distance) = aslice("pattern", [; modifiers ])
([$i0, $s0, $d0], ...) = aslice("pattern", [; modifiers ], @inputs)

DISTANCE

use String::Approx 'adist';

$dist = adist("pattern", $input);
@dist = adist("pattern", @input);
Return the edit distance or distances between the pattern and the input or inputs. Zero edit distance means exact match. (Remember that the match can 'float' in the inputs, the match is a substring match.) If the pattern is longer than the input or inputs, the returned distance or distance is or are negative.: use String::Approx 'adistr';; $dist = adistr("pattern", $input);
@dist = adistr("pattern", @inputs);
Return the relative edit distance or distances between the pattern and the input or inputs. Zero relative edit dis
tance means exact match, one means completely different.
(Remember that the match can 'float' in the inputs, the
match is a substring match.) If the pattern is longer
than the input or inputs, the returned distance or dis
tances is or are negative.
You can use adist() or adistr() to sort the inputs accord ing to their approximateness:: my %d;
@d{@inputs} = map { abs } adistr("pattern", @in; puts);
my @d = sort { $d{$a} <=> $d{$b} } @inputs;
Now @d contains the inputs, the most like "pattern" first.

CONTROLLING THE CACHE

"String::Approx" maintains a LU (least-used) cache that
holds the 'matching engines' for each instance of a pat_
tern+modifiers. The cache is intended to help the case where you match a small set of patterns against a large
set of string. However, the more engines you cache the
more you eat memory. If you have a lot of different pat
terns or if you have a lot of memory to burn, you may want
to control the cache yourself. For example, allowing a
larger cache consumes more memory but probably runs a lit
tle bit faster since the cache fills (and needs flushing)
less often.

The cache has two parameters: max and purge. The first one is the maximum size of the cache and the second one is the cache flushing ratio: when the number of cache entries exceeds max, max times purge cache entries are flushed. The default values are 1000 and 0.75, respectively, which means that when the 1001st entry would be cached, 750 least used entries will be removed from the cache. To access the parameters you can use the calls: $now_max = String::Approx::cache_max();
String::Approx::cache_max($new_max);; $now_purge = String::Approx::cache_purge();
String::Approx::cache_purge($new_purge);; $limit = String::Approx::cache_n_purge();
To be honest, there are actually two caches: the first one
is used far the patterns with no modifiers, the second one
for the patterns with pattern modifiers. Using the stan
dard parameters you will therefore actually cache up to
2000 entries. The above calls control both caches for the
same price.
To disable caching completely use: String::Approx::cache_disable();
Note that this doesn't flush any possibly existing cache entries, to do that use: String::Approx::cache_flush_all();

NOTES

Because matching is by substrings, not by whole strings, insertions and substitutions produce often very similar
results: "abcde" matches "axbcde" either by insertion or
substitution of "x".

The maximum edit distance is also the maximum number of edits. That is, the "I2" in: amatch("abcd", ["I2"])
is useless because the maximum edit distance is (implic itly) 1. You may have meant to say: amatch("abcd", ["2D1S1"])
or something like that.
If you want to simulate transposes: feet fete
you need to allow at least edit distance of two because in terms of our edit primitives a transpose is first one deletion and then one insertion.
TEXT POSITION
The starting and ending positions of matching, substitut ing, indexing, or slicing can be changed from the begin ning and end of the input(s) to some other positions by using either or both of the modifiers: "initial_position=24"
"final_position=42"
or the both the modifiers: "initial_position=24"
"position_range=10"
By setting the "position_range" to be zero you can limit (anchor) the operation to happen only once (if a match is possible) at the position.

VERSION

Major release 3.

CHANGES FROM VERSION 2

GOOD NEWS

The version 3 is 2-3 times faster than version 2 No pattern length limitation: The algorithm is independent on the pattern length:
its time complexity is O(kn), where k is the number of edits and n the length of the text (input). The pre
processing of the pattern will of course take some
O(m) (m being the pattern length) time, but "amatch()"
and "asubstitute()" cache the result of this prepro
cessing so that it is done only once per pattern.
BAD NEWS
You do need a C compiler to install the module: Perl's regular expressions are no more used; instead a
faster and more scalable algorithm written in C is
used.
"asubstitute()" is now always stingy: The string matched and substituted is now always
stingy, as short as possible. It used to be as long
as possible. This is an unfortunate change stemming
from switching the matching algorithm. Example: with
edit distance of two and substituting for "word" from
"cork" and "wool" previously did match "cork" and
"wool". Now it does match "or" and "wo". As little
as possible, or, in other words, with as much approxi
mateness, as many edits, as possible. Because there
is no need to match the "c" of "cork", it is not
matched.
no more "aregex()" because regular expressions are no more used no more "compat1" for String::Approx version 1 compatibil ity

ACKNOWLEDGEMENTS

The following people have provided valuable test cases,
documentation clarifications, and other feedback:

Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A.
Chervitz, Aldo Calpini, David Curiel, Teun van den Dool,
Alberto Fontaneda, Rob Fugina, Dmitrij Frishman, Lars
Gregersen, Kevin Greiner, B. Elijah Griffin, Mike Hanafey,
Mitch Helle, Ricky Houghton, Helmut Jarausch, Damian
Keefe, Ben Kennedy, Craig Kelley, Franz Kirsch, Dag Kris
tian, Mark Land, J. D. Laub, Juha Muilu, Sergey Novoselov,
Andy Oram, Eric Promislow, Nikolaus Rath, Stefan Ram, Dag
Kristian Rognlien, Stewart Russell, Slaven Rezic, Chris
Rosin, Pasha Sadri, Ilya Sandler, Bob J.A. Schijvenaars,
Ross Smith, Frank Tobin, Greg Ward, Rick Wise.

The matching algorithm was developed by Udi Manber, Sun
Wu, and Burra Gopal in the Department of Computer Science,
University of Arizona.

AUTHOR

Jarkko Hietaniemi <jhi@iki.fi>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

Furthermore: no warranties or obligations of any kind are
given, and the separate file COPYRIGHT must be included intact in all copies and derived materials.

docs.sk

comprehensive documentation repository

In section