unicode::collate(3)
NAME
Unicode::Collate - Unicode Collation Algorithm
SYNOPSIS
use Unicode::Collate; #construct $Collator = Unicode::Collate->new(%tailoring); #sort @sorted = $Collator->sort(@not_sorted); #compare $result = $Collator->cmp($a, $b); # returns 1, 0, or -1.
DESCRIPTION
Constructor and Tailoring
- The "new" method returns a collator object.
- $Collator = Unicode::Collate->new(
alternate => $alternate,
backwards => $levelNumber, # or @levelNumbers
entry => $element,
normalization => $normalization_form,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
katakana_before_hiragana => $bool,
level => $collationLevel,
overrideCJK => overrideCJK,
overrideHangul => overrideHangul,
preprocess => preprocess,
rearrange => @charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool, - );
# if %tailoring is false (i.e. empty),
# $Collator should do the default collation. - alternate
- -- see 3.2.2 Alternate Weighting, UTR #10.
- This key allows to alternate weighting for variable
collation elements, which are marked with an ASTERISK
in the table (NOTE: Many punction marks and symbols
are variable in allkeys.txt).
alternate => 'blanked', 'non-ignorable', 'shifted',or 'shift-trimmed'. - These names are case-insensitive. By default (if
specification is omitted), 'shifted' is adopted.
'Blanked' Variable elements are ignorable atlevels 1 through 3;considered at the 4th level. - 'Non-ignorable' Variable elements are not reset to
- ignorable.
- 'Shifted' Variable elements are ignorable at
- levels 1 through 3
- their level 4 weight is replaced
- by the old level 1 weight.
Level 4 weight for Non-Variable - elements is 0xFFFF.
- 'Shift-Trimmed' Same as 'shifted', but all FFFF's
- at the 4th level
- are trimmed.
- backwards
- -- see 3.1.2 French Accents, UTR #10.
backwards => $levelNumber or @levelNumbers - Weights in reverse order; ex. level 2 (diacritic
ordering) in French. If omitted, forwards at all the
levels. - entry
- -- see 3.1 Linguistic Features; 3.2.1 File Format, UTR
#10. - Overrides a default order or defines additional colla
tion elements
entry => <<'ENTRIES', # use the UCA file format - 00E6 ; [.0861.0020.0002.00E6] [.08B1.0020.0002.00E6] #
- ligature <ae> as <a><e>
0063 0068 ; [.0893.0020.0002.0063] # "ch" in tra - ditional Spanish
0043 0068 ; [.0893.0020.0008.0043] # "Ch" in tra - ditional Spanish
ENTRIES - ignoreName
ignoreChar - -- see Completely Ignorable, 3.2.2 Alternate Weight
ing, UTR #10. - Makes the entry in the table ignorable. If a colla
tion element is ignorable, it is ignored as if the
element had been deleted from there. - E.g. when 'a' and 'e' are ignorable, 'element' is
equal to 'lament' (or 'lmnt'). - level
- -- see 4.3 Form a sort key for each string, UTR #10.
- Set the maximum level. Any higher levels than the
specified one are ignored.
Level 1: alphabetic ordering
Level 2: diacritic ordering
Level 3: case ordering
Level 4: tie-breaking (e.g. in the case when alternate is 'shifted')ex.level => 2, - If omitted, the maximum is the 4th.
- normalization
- -- see 4.1 Normalize each input string, UTR #10.
- If specified, strings are normalized before prepara
tion of sort keys (the normalization is executed after
preprocess). - As a form name, one of the following names must be
used.
'C' or 'NFC' for Normalization Form C
'D' or 'NFD' for Normalization Form D
'KC' or 'NFKC' for Normalization Form KC
'KD' or 'NFKD' for Normalization Form KD - If omitted, the string is put into Normalization Form
D. - If "undef" is passed explicitly as the value for this
key, any normalization is not carried out (this may
make tailoring easier if any normalization is not
desired). - see CAVEAT.
- overrideCJK
- -- see 7.1 Derived Collation Elements, UTR #10.
- By default, mapping of CJK Unified Ideographs uses the
Unicode codepoint order. But the mapping of CJK Uni
fied Ideographs may be overrided. - ex. CJK Unified Ideographs in the JIS code point
order.
overrideCJK => sub {my $u = shift; # get a Unicode codepoint
my $b = pack('n', $u); # to UTF-16BE
my $s = your_unicode_to_sjis_converter($b); #convert
my $n = unpack('n', $s); # convert sjis toshort
[ $n, 0x20, 0x2, $u ]; # return the collation element}, - ex. ignores all CJK Unified Ideographs.
overrideCJK => sub {()}, # CODEREF returning emptylist
# where ->eq("Pe00}rl", "Perl") is true# as U+4E00 is a CJK Unified Ideograph and to beignorable.If "undef" is passed explicitly as the value for this
key, weights for CJK Unified Ideographs are treated as
undefined. But assignment of weight for CJK Unified
Ideographs in table or entry is still valid. - overrideHangul
-- see 7.1 Derived Collation Elements, UTR #10.By default, Hangul Syllables are decomposed into
Hangul Jamo. But the mapping of Hangul Syllables may
be overrided.This tag works like overrideCJK, so see there for
examples.If you want to override the mapping of Hangul Sylla
bles, the Normalization Forms D and KD are not appro
priate (they will be decomposed before overriding).If "undef" is passed explicitly as the value for this
key, weight for Hangul Syllables is treated as unde
fined without decomposition into Hangul Jamo. But
definition of weight for Hangul Syllables in table or
entry is still valid. - preprocess
-- see 5.1 Preprocessing, UTR #10.If specified, the coderef is used to preprocess before
the formation of sort?keys.:ex. dropping English articles, such as "a" or "the".
Then, "the pen" is benore "a pencil".?preprocess => su| {my $str = thift;
$str =~ s/h
$str; e}, )/rearrange g-- see 3.1.3 Rearrangement, UTR #10.;Characters that are not coded in logical order and to
be rearranged. By default,
rearrange => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4 ],If you want to disallow any rearrangement, pass
"undef" or "[]" (a reference to an empty list) as the
value for this key.table-- see 3.2 Default Unicode Collation Element Table,
UTR #10.You can use another element table if desired. The
table file must be in your "lib/Unicode/Collate"
directory.By default, the file "lib/Unicode/Collate/allkeys.txt"
is used.If "undef" is passed explicitly as the value for this
key, no file is read (but you can define collation
elements via entry).A typical way to define a collation element table
without any file of table:
$onlyABC = Unicode::Collate->new(table => undef,
entry => << 'ENTRIES',0061 ; [.0101.0020.0002.0061] # LATIN SMALL LETTER A
0041 ; [.0101.0020.0008.0041] # LATIN CAPITAL LETTER A
0062 ; [.0102.0020.0002.0062] # LATIN SMALL LETTER B
0042 ; [.0102.0020.0008.0042] # LATIN CAPITAL LETTER B
0063 ; [.0103.0020.0002.0063] # LATIN SMALL LETTER C
0043 ; [.0103.0020.0008.0043] # LATIN CAPITAL LETTER C
ENTRIES); - undefName
undefChar - -- see 6.3.4 Reducing the Repertoire, UTR #10.
- Undefines the collation element as if it were unas
signed in the table. This reduces the size of the
table. If an unassigned character appears in the
string to be collated, the sort key is made from its
codepoint as a single-character collation element, as
it is greater than any other assigned collation
elements (in the codepoint order among the unassigned
characters). But, it'd be better to ignore characters
unfamiliar to you and maybe never used. - katakana_before_hiragana
upper_before_lower - -- see 6.6 Case Comparisons; 7.3.1 Tertiary Weight
Table, UTR #10. - By default, lowercase is before uppercase and hiragana
is before katakana. - If the tag is made true, this is reversed.
- NOTE: These tags simplemindedly assume any lower
case/uppercase or hiragana/katakana distinctions
should occur in level 3, and their weights at level 3
should be same as those mentioned in 7.3.1, UTR #10.
If you define your collation elements which violates
this, these tags doesn't work validly. - Methods for Collation
- "@sorted = $Collator->sort(@not_sorted)"
- Sorts a list of strings.
- "$result = $Collator->cmp($a, $b)"
- Returns 1 (when $a is greater than $b) or 0 (when $a
is equal to $b) or -1 (when $a is lesser than $b). - "$result = $Collator->eq($a, $b)"
"$result = $Collator->ne($a, $b)"
"$result = $Collator->lt($a, $b)"
"$result = $Collator->le($a, $b)"
"$result = $Collator->gt($a, $b)"
"$result = $Collator->ge($a, $b)" - They works like the same name operators as theirs.
eq : whether $a is equal to $b.
ne : whether $a is not equal to $b.
lt : whether $a is lesser than $b.
le : whether $a is lesser than $b or equal to $b.
gt : whether $a is greater than $b.
ge : whether $a is greater than $b or equal to $b. - "$sortKey = $Collator->getSortKey($string)"
-- see 4.3 Form a sort key for each string, UTR #10.Returns a sort key.You compare the sort keys using a binary comparison
and get the result of the comparison of the strings
using UCA.
$Collator->getSortKey($a) cmp $Collator->getSortKey($b)
is equivalent to$Collator->cmp($a, $b)"$sortKeyForm = $Collator->viewSortKey($string)"Returns a string formalized to display a sort key.
Weights are enclosed with '[' and ']' and level bound
aries are denoted by '|'.
use Unicode::Collate;
my $c = Unicode::Collate->new();
print $c->viewSortKey("Perl"),"0;
# output:
# [09B3 08B1 09CB 094F|0020 0020 0020 0020|00080002 0002 0002|FFFF FFFF FFFF FFFF]
# Level 1 Level 2 Level 3Level 4 - "$position = $Collator->index($string, $substring)"
"($position, $length) = $Collator->index($string, $sub
string)" - -- see 6.8 Searching, UTR #10.
- If $substring matches a part of $string, returns the
position of the first occurrence of the matching part
in scalar context; in list context, returns a two-ele
ment list of the position and the length of the match
ing part. - Notice that the length of the matching part may differ from the length of $substring.
- Note that the position and the length are counted on
the string after the process of preprocess, normaliza
tion, and rearrangement. Therefore, in case the spec
ified string is not binary equal to the prepro
cessed/normalized/rearranged string, the position and
the length may differ form those on the specified
string. But it is guaranteed that, if matched, it
returns a non-negative value as $position. - If $substring does not match any part of $string,
returns "-1" in scalar context and an empty list in
list context. - e.g. you say
my $Collator = Unicode::Collate->new( normalization=> undef, level => 1 );
my $str = "Ich muF} studieren.";
my $sub = "mC}ss";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {$match = substr($str, $pos, $len);} - and get "muF}" in $match since ""mu"ß""" is pri
mary equal to ""m"ü"ss"". - Other Methods
- UCA_Version
- Returns the version number of Unicode Technical Stan
dard 10 this module consults. - Base_Unicode_Version
- Returns the version number of the Unicode Standard
this module is based on. - EXPORT
- None by default.
- TODO
- Unicode::Collate has not been ported to EBCDIC. The code
mostly would work just fine but a decision needs to be
made: how the module should work in EBCDIC? Should the
low 256 characters be understood as Unicode or as EBCDIC
code points? Should one be chosen or should there be a
way to do either? Or should such translation be left out
side the module for the user to do, for example by using
Encode::from_to()? (or utf8::uni_ code_to_native()/utf8::native_to_unicode()?) - CAVEAT
- Use of the "normalization" parameter requires the Uni
code::Normalize module. - If you need not it (say, in the case when you need not
handle any combining characters), assign "normalization =>
undef" explicitly. - -- see 6.5 Avoiding Normalization, UTR #10.
- BUGS
- "index()" is an experimental method and its return value
may be unreliable. The correct implementation for
"index()" must be based on Locale-Sensitive Support: Level
3 in UTR #18, Unicode Regular Expression Guidelines. - See also 4.2 Locale-Dependent Graphemes in UTR #18.
AUTHOR
- SADAHIRO Tomoyuki, <SADAHIRO@cpan.org>
- http://homepage1.nifty.com/nomenclator/perl/
- Copyright(C) 2001-2002, SADAHIRO Tomoyuki. Japan. All
- rights reserved.
- This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
SEE ALSO
- http://www.unicode.org/unicode/reports/tr10/
- Unicode Collation Algorithm - UTR #10
- http://www.unicode.org/unicode/reports/tr10/allkeys.txt
- The Default Unicode Collation Element Table
- http://www.unicode.org/unicode/reports/tr15/
- Unicode Normalization Forms - UAX #15
- http://www.unicode.org/unicode/reports/tr18
- Unicode Regular Expression Guidelines - UTR #18
- Unicode::Normalize