unicode::collate(3)

NAME

Unicode::Collate - Unicode Collation Algorithm

SYNOPSIS

use Unicode::Collate;
#construct
$Collator = Unicode::Collate->new(%tailoring);
#sort
@sorted = $Collator->sort(@not_sorted);
#compare
$result = $Collator->cmp($a, $b); # returns 1, 0, or -1.

DESCRIPTION

Constructor and Tailoring

The "new" method returns a collator object.
$Collator = Unicode::Collate->new(
alternate => $alternate,
backwards => $levelNumber, # or @levelNumbers
entry => $element,
normalization => $normalization_form,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
katakana_before_hiragana => $bool,
level => $collationLevel,
overrideCJK => overrideCJK,
overrideHangul => overrideHangul,
preprocess => preprocess,
rearrange => @charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
);
# if %tailoring is false (i.e. empty),
# $Collator should do the default collation.
alternate
-- see 3.2.2 Alternate Weighting, UTR #10.
This key allows to alternate weighting for variable
collation elements, which are marked with an ASTERISK
in the table (NOTE: Many punction marks and symbols
are variable in allkeys.txt).

alternate => 'blanked', 'non-ignorable', 'shifted',
or 'shift-trimmed'.
These names are case-insensitive. By default (if
specification is omitted), 'shifted' is adopted.

'Blanked' Variable elements are ignorable at
levels 1 through 3;
considered at the 4th level.
'Non-ignorable' Variable elements are not reset to
ignorable.
'Shifted' Variable elements are ignorable at
levels 1 through 3
their level 4 weight is replaced
by the old level 1 weight.
Level 4 weight for Non-Variable
elements is 0xFFFF.
'Shift-Trimmed' Same as 'shifted', but all FFFF's
at the 4th level
are trimmed.
backwards
-- see 3.1.2 French Accents, UTR #10.

backwards => $levelNumber or @levelNumbers
Weights in reverse order; ex. level 2 (diacritic
ordering) in French. If omitted, forwards at all the
levels.
entry
-- see 3.1 Linguistic Features; 3.2.1 File Format, UTR
#10.
Overrides a default order or defines additional colla
tion elements

entry => <<'ENTRIES', # use the UCA file format
00E6 ; [.0861.0020.0002.00E6] [.08B1.0020.0002.00E6] #
ligature <ae> as <a><e>
0063 0068 ; [.0893.0020.0002.0063] # "ch" in tra
ditional Spanish
0043 0068 ; [.0893.0020.0008.0043] # "Ch" in tra
ditional Spanish
ENTRIES
ignoreName
ignoreChar
-- see Completely Ignorable, 3.2.2 Alternate Weight
ing, UTR #10.
Makes the entry in the table ignorable. If a colla
tion element is ignorable, it is ignored as if the
element had been deleted from there.
E.g. when 'a' and 'e' are ignorable, 'element' is
equal to 'lament' (or 'lmnt').
level
-- see 4.3 Form a sort key for each string, UTR #10.
Set the maximum level. Any higher levels than the
specified one are ignored.

Level 1: alphabetic ordering
Level 2: diacritic ordering
Level 3: case ordering
Level 4: tie-breaking (e.g. in the case when alter
nate is 'shifted')
ex.level => 2,
If omitted, the maximum is the 4th.
normalization
-- see 4.1 Normalize each input string, UTR #10.
If specified, strings are normalized before prepara
tion of sort keys (the normalization is executed after
preprocess).
As a form name, one of the following names must be
used.

'C' or 'NFC' for Normalization Form C
'D' or 'NFD' for Normalization Form D
'KC' or 'NFKC' for Normalization Form KC
'KD' or 'NFKD' for Normalization Form KD
If omitted, the string is put into Normalization Form
D.
If "undef" is passed explicitly as the value for this
key, any normalization is not carried out (this may
make tailoring easier if any normalization is not
desired).
see CAVEAT.
overrideCJK
-- see 7.1 Derived Collation Elements, UTR #10.
By default, mapping of CJK Unified Ideographs uses the
Unicode codepoint order. But the mapping of CJK Uni
fied Ideographs may be overrided.
ex. CJK Unified Ideographs in the JIS code point
order.

overrideCJK => sub {
my $u = shift; # get a Unicode code
point
my $b = pack('n', $u); # to UTF-16BE
my $s = your_unicode_to_sjis_converter($b); #
convert
my $n = unpack('n', $s); # convert sjis to
short
[ $n, 0x20, 0x2, $u ]; # return the colla
tion element
},
ex. ignores all CJK Unified Ideographs.

overrideCJK => sub {()}, # CODEREF returning empty
list

# where ->eq("Pe00}rl", "Perl") is true
# as U+4E00 is a CJK Unified Ideograph and to be
ignorable.
If "undef" is passed explicitly as the value for this
key, weights for CJK Unified Ideographs are treated as
undefined. But assignment of weight for CJK Unified
Ideographs in table or entry is still valid.
overrideHangul
-- see 7.1 Derived Collation Elements, UTR #10.
By default, Hangul Syllables are decomposed into
Hangul Jamo. But the mapping of Hangul Syllables may
be overrided.
This tag works like overrideCJK, so see there for
examples.
If you want to override the mapping of Hangul Sylla
bles, the Normalization Forms D and KD are not appro
priate (they will be decomposed before overriding).
If "undef" is passed explicitly as the value for this
key, weight for Hangul Syllables is treated as unde
fined without decomposition into Hangul Jamo. But
definition of weight for Hangul Syllables in table or
entry is still valid.
preprocess
-- see 5.1 Preprocessing, UTR #10.
If specified, the coderef is used to preprocess before
the formation of sort?keys.
:
ex. dropping English articles, such as "a" or "the".
Then, "the pen" is benore "a pencil".
?
preprocess => su| {
my $str = thift;
$str =~ s/h
$str; e
}, )
/
rearrange g
-- see 3.1.3 Rearrangement, UTR #10.
;
Characters that are not coded in logical order and to
be rearranged. By default,

rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],
If you want to disallow any rearrangement, pass
"undef" or "[]" (a reference to an empty list) as the
value for this key.
table
-- see 3.2 Default Unicode Collation Element Table,
UTR #10.
You can use another element table if desired. The
table file must be in your "lib/Unicode/Collate"
directory.
By default, the file "lib/Unicode/Collate/allkeys.txt"
is used.
If "undef" is passed explicitly as the value for this
key, no file is read (but you can define collation
elements via entry).
A typical way to define a collation element table
without any file of table:

$onlyABC = Unicode::Collate->new(
table => undef,
entry => << 'ENTRIES',
0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
ENTRIES
);
undefName
undefChar
-- see 6.3.4 Reducing the Repertoire, UTR #10.
Undefines the collation element as if it were unas
signed in the table. This reduces the size of the
table. If an unassigned character appears in the
string to be collated, the sort key is made from its
codepoint as a single-character collation element, as
it is greater than any other assigned collation
elements (in the codepoint order among the unassigned
characters). But, it'd be better to ignore characters
unfamiliar to you and maybe never used.
katakana_before_hiragana
upper_before_lower
-- see 6.6 Case Comparisons; 7.3.1 Tertiary Weight
Table, UTR #10.
By default, lowercase is before uppercase and hiragana
is before katakana.
If the tag is made true, this is reversed.
NOTE: These tags simplemindedly assume any lower
case/uppercase or hiragana/katakana distinctions
should occur in level 3, and their weights at level 3
should be same as those mentioned in 7.3.1, UTR #10.
If you define your collation elements which violates
this, these tags doesn't work validly.
Methods for Collation
"@sorted = $Collator->sort(@not_sorted)"
Sorts a list of strings.
"$result = $Collator->cmp($a, $b)"
Returns 1 (when $a is greater than $b) or 0 (when $a
is equal to $b) or -1 (when $a is lesser than $b).
"$result = $Collator->eq($a, $b)"
"$result = $Collator->ne($a, $b)"
"$result = $Collator->lt($a, $b)"
"$result = $Collator->le($a, $b)"
"$result = $Collator->gt($a, $b)"
"$result = $Collator->ge($a, $b)"
They works like the same name operators as theirs.

eq : whether $a is equal to $b.
ne : whether $a is not equal to $b.
lt : whether $a is lesser than $b.
le : whether $a is lesser than $b or equal to $b.
gt : whether $a is greater than $b.
ge : whether $a is greater than $b or equal to $b.
"$sortKey = $Collator->getSortKey($string)"
-- see 4.3 Form a sort key for each string, UTR #10.
Returns a sort key.
You compare the sort keys using a binary comparison
and get the result of the comparison of the strings
using UCA.

$Collator->getSortKey($a) cmp $Collator->get
SortKey($b)

is equivalent to
$Collator->cmp($a, $b)
"$sortKeyForm = $Collator->viewSortKey($string)"
Returns a string formalized to display a sort key.
Weights are enclosed with '[' and ']' and level bound
aries are denoted by '|'.

use Unicode::Collate;
my $c = Unicode::Collate->new();
print $c->viewSortKey("Perl"),"0;

# output:
# [09B3 08B1 09CB 094F|0020 0020 0020 0020|0008
0002 0002 0002|FFFF FFFF FFFF FFFF]
# Level 1 Level 2 Level 3
Level 4
"$position = $Collator->index($string, $substring)"
"($position, $length) = $Collator->index($string, $sub
string)"
-- see 6.8 Searching, UTR #10.
If $substring matches a part of $string, returns the
position of the first occurrence of the matching part
in scalar context; in list context, returns a two-ele
ment list of the position and the length of the match
ing part.
Notice that the length of the matching part may differ from the length of $substring.
Note that the position and the length are counted on
the string after the process of preprocess, normaliza
tion, and rearrangement. Therefore, in case the spec
ified string is not binary equal to the prepro
cessed/normalized/rearranged string, the position and
the length may differ form those on the specified
string. But it is guaranteed that, if matched, it
returns a non-negative value as $position.
If $substring does not match any part of $string,
returns "-1" in scalar context and an empty list in
list context.
e.g. you say

my $Collator = Unicode::Collate->new( normalization
=> undef, level => 1 );
my $str = "Ich muF} studieren.";
my $sub = "mC}ss";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {
$match = substr($str, $pos, $len);
}
and get "muF}" in $match since ""mu"ß""" is pri
mary equal to ""m"ü"ss"".
Other Methods
UCA_Version
Returns the version number of Unicode Technical Stan
dard 10 this module consults.
Base_Unicode_Version
Returns the version number of the Unicode Standard
this module is based on.
EXPORT
None by default.
TODO
Unicode::Collate has not been ported to EBCDIC. The code
mostly would work just fine but a decision needs to be
made: how the module should work in EBCDIC? Should the
low 256 characters be understood as Unicode or as EBCDIC
code points? Should one be chosen or should there be a
way to do either? Or should such translation be left out
side the module for the user to do, for example by using
Encode::from_to()? (or utf8::uni_ code_to_native()/utf8::native_to_unicode()?)
CAVEAT
Use of the "normalization" parameter requires the Uni
code::Normalize module.
If you need not it (say, in the case when you need not
handle any combining characters), assign "normalization =>
undef" explicitly.
-- see 6.5 Avoiding Normalization, UTR #10.
BUGS
"index()" is an experimental method and its return value
may be unreliable. The correct implementation for
"index()" must be based on Locale-Sensitive Support: Level
3 in UTR #18, Unicode Regular Expression Guidelines.
See also 4.2 Locale-Dependent Graphemes in UTR #18.

AUTHOR

SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2001-2002, SADAHIRO Tomoyuki. Japan. All
rights reserved.
This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

SEE ALSO

http://www.unicode.org/unicode/reports/tr10/
Unicode Collation Algorithm - UTR #10
http://www.unicode.org/unicode/reports/tr10/allkeys.txt
The Default Unicode Collation Element Table
http://www.unicode.org/unicode/reports/tr15/
Unicode Normalization Forms - UAX #15
http://www.unicode.org/unicode/reports/tr18
Unicode Regular Expression Guidelines - UTR #18
Unicode::Normalize
Copyright © 2010-2025 Platon Technologies, s.r.o.           Home | Man pages | tLDP | Documents | Utilities | About
Design by styleshout