map8(3)
NAME
Unicode::Map8 - Mapping table between 8-bit chars and Uni
code
SYNOPSIS
require Unicode::Map8;
my $no_map = Unicode::Map8->new("ISO646-NO") || die;
my $l1_map = Unicode::Map8->new("latin1") || die;
my $ustr = $no_map->to16("V}re norske tegn b|r {res0);
my $lstr = $l1_map->to8($ustr);
print $lstr;
print $no_map->tou("V}re norske tegn b|r {res0)->utf8
DESCRIPTION
The Unicode::Map8 class implement efficient mapping tables
between 8-bit character sets and 16 bit character sets
like Unicode. The tables are efficient both in terms of
space allocated and translation speed. The 16-bit strings
is assumed to use network byte order.
The following methods are available:
- $m = Unicode::Map8->new( [$charset] )
- The object constructor creates new instances of the
Unicode::Map8 class. I takes an optional argument
that specify then name of a 8-bit character set to
initialize mappings from. The argument can also be a
the name of a mapping file. If the charset/file can
not be located, then the constructor returns undef. - If you omit the argument, then an empty mapping table
is constructed. You must then add mapping pairs to it
using the addpair() method described below. - $m->addpair( $u8, $u16 );
- Adds a new mapping pair to the mapping object. It
takes two arguments. The first is the code value in
the 8-bit character set and the second is the corre
sponding code value in the 16-bit character set. The
same codes can be used multiple times (but using the
same pair has no effect). The first definition for a
code is the one that is used. - Consider the following example:
$m->addpair(0x20, 0x0020);
$m->addpair(0x20, 0x00A0);
$m->addpair(0xA0, 0x00A0); - It means that the character 0x20 and 0xA0 in the 8-bit
charset maps to themselves in the 16-bit set, but in
the 16-bit character set 0x0A0 maps to 0x20. - $m->default_to8( $u8 )
- Set the code of the default character to use when map
ping from 16-bit to 8-bit strings. If there is no
mapping pair defined for a character then this default
is substituted by to8() and recode8(). - $m->default_to16( $u16 )
- Set the code of the default character to use when map
ping from 8-bit to 16-bit strings. If there is no map
ping pair defined for a character then this default is
used by to16(), tou() and recode8(). - $m->nostrict;
- All undefined mappings are replaced with the identity
mapping. Undefined character are normally just
removed (or replaced with the default if defined) when
converting between character sets. - $m->to8( $ustr );
- Converts a 16-bit character string to the correspond
ing string in the 8-bit character set. - $m->to16( $str );
- Converts a 8-bit character string to the corresponding
string in the 16-bit character set. - $m->tou( $str );
- Same an to16() but return a Unicode::String object
instead of a plain UCS2 string. - $m->recode8($m2, $str);
- Map the string $str from one 8-bit character set ($m)
to another one ($m2). Since we assume we know the
mappings towards the common 16-bit encoding we can use
this to convert between any of the 8-bit character
sets. - $m->to_char16( $u8 )
- Maps a single 8-bit character code to an 16-bit code.
If the 8-bit character is unmapped then the constant
NOCHAR is returned. The default is not used and the
callback method is not invoked. - $m->to_char8( $u16 )
- Maps a single 16-bit character code to an 8-bit code.
If the 16-bit character is unmapped then the constant
NOCHAR is returned. The default is not used and the
callback method is not invoked. - The following callback methods are available. You can
override these methods by creating a subclass of Uni
code::Map8. - $m->unmapped_to8
- When mapping to 8-bit character string and there is no
mapping defined (and no default either), then this
method is called as the last resort. It is called
with a single integer argument which is the code of
the unmapped 16-bit character. It is expected to
return a string that will be incorporated in the 8-bit
string. The default version of this method always
returns an empty string. - Example:
package MyMapper;
@ISA=qw(Unicode::Map8);sub unmapped_to8
{my($self, $code) = @_;
require Unicode::CharName;
"<" . Unicode::CharName::uname($code) . ">";} - $m->unmapped_to16
Likewise when mapping to 16-bit character string and
no mapping is defined then this method is called. It
should return a 16-bit string with the bytes in net
work byte order. The default version of this method
always returns an empty string.
FILES
The Unicode::Map8 constructor can parse two different file
formats; a binary format and a textual format.
The binary format is simple. It consist of a sequence of
16-bit integer pairs in network byte order. The first
pair should contain the magic value 0xFFFE, 0x0001. Of
each pair, the first value is the code of an 8-bit charac
ter and the second is the code of the 16-bit character.
If follows from this that the first value should be less
than 256.
The textual format consist of lines that is either a com
ment (first non-blank character is '#'), a completely
blank line or a line with two hexadecimal numbers. The
hexadecimal numbers must be preceded by "0x" as in C and
Perl. This is the same format used by the Unicode mapping
files available from <URL:ftp://ftp.unicode.org/Public>.
The mapping table files are installed in the Uni_
code/Map8/maps directory somewhere in the Perl @INC path.
The variable $Unicode::Map8::MAPS_DIR is the complete path
name to this directory. Binary mapping files are stored
within this directory with the suffix .bin. Textual map
ping files are stored with the suffix .txt.
The scripts map8_bin2txt and map8_txt2bin can translate
between these mapping file formats.
A special file called aliases within $MAPS_DIR specify all
the alias names that can be used to denote the various
character sets. The first name of each line is the real
file name and the rest is alias names separated by space.
The `"umap --list"' command be used to list the character
sets supported.
BUGS
Does not handle Unicode surrogate pairs as a single char
acter.
SEE ALSO
umap(1), Unicode::String
COPYRIGHT
Copyright 1998 Gisle Aas.
- This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.