map(3)

NAME

Unicode::Map V0.112 - maps charsets from and to utf16 uni
code

SYNOPSIS

use Unicode::Map();
$Map = new Unicode::Map("ISO-8859-1");
$utf16 = $Map -> to_unicode ("Hello world!");
  => $utf16 == " H e l l o   w o r l d !"
$locale = $Map -> from_unicode ($utf16);
  => $locale == "Hello world!"
A more detailed description below.
2do: short note about perl's Unicode perspectives.

DESCRIPTION

This module converts strings from and to 2-byte Unicode
UCS2 format. All mappings happen via 2 byte UTF16 encod
ings, not via 1 byte UTF8 encoding. To transform these use
Unicode::String.

For historical reasons this module coexists with Uni
code::Map8. Please use Unicode::Map8 unless you need to
care for two byte character sets, e.g. chinese GB2312.
Anyway, if you stick to the basic functionality (see docu
mentation) you can use both modules equivalently.

Practically this module will disappear from earth sooner
or later as Unicode mapping support needs somehow to get
into perl's core. If you like to work on this field please
don't hesitate contacting Gisle Aas!

This module can't deal directly with utf8. Use Uni
code::String to convert utf8 to utf16 and vice versa.

Character mapping is according to the data of binary map
files in Unicode::Map hierarchy. Binary mapfiles can also
be created with this module, enabling you to install own
specific character sets. Refer to mkmapfile or file REG
ISTRY in the Unicode::Map hierarchy.

CONVERSION METHODS

Probably these are the only methods you will need from
this module. Their usage is compatible with Unicode::Map8.

new $Map = new Unicode::Map("GB2312-80")
Returns a new Map object for GB2312-80 encoding.
from_unicode
$dest = $Map -> from_unicode ($src)
Creates a string in locale charset representation from
utf16 encoded string $src.
to_unicode
$dest = $Map -> to_unicode ($src)
Creates a string in utf16 representation from $src.
to8 Alias for from_unicode. For compatibility with Uni
code::Map8
to16
Alias for to_unicode. For compatibility with Uni code::Map8

WARNINGS

You can demand Unicode::Map to issue warnings at dep
recated or incompatible usage with the constants
WARN_DEFAULT, WARN_DEPRECATION or WARN_COMPATIBILITY.
The latter both can be ored together.
No special warnings:
$Unicode::Map::WARNINGS = Unicode::Map::WARN_DEFAULT
Warnings for deprecated usage:
$Unicode::Map::WARNINGS = Unicode::Map::WARN_DEPRECA
TION
Warnings for incompatible usage:
$Unicode::Map::WARNINGS = Unicode::Map::WARN_COMPATI
BILITY

MAINTAINANCE METHODS

Note: These methods are solely for the maintainance of
Unicode::Map. Using any of these methods will lead to
programs incompatible with Unicode::Map8.

alias
@list = $Map -> alias ($csid)
Returns a list of alias names of character set $csid.
mapping
$path = $Map -> mapping ($csid)
Returns the absolute path of binary character mapping
for character set $csid according to REGISTRY file of
Unicode::Map.
id $real_id||"" = $Map -> id ($test_id)

Returns a valid character set identifier $real_id, if $test_id is a valid character set name or alias name according to REGISTRY file of Unicode::Map.
ids @ids = $Map -> ids()

Returns a list of all character set names defined in
REGISTRY file.
read_text_mapping
1||0 = $Map -> read_text_mapping ($csid, $path, $style)
Read a text mapping of style $style named $csid from filename $path. The mapping then can be saved to a
file with method: write_binary_mapping. <$style> can
be:

style description
"unicode" A text mapping as of ftp://ftp.uni
code.org/MAPPINGS/
"" Same as "unicode"
"reverse" Similar to unicode, but both columns are
switched
"keld" A text mapping as of ftp://dku
ug.dk/i18n/charmaps/
src $path = $Map -> src ($csid)

Returns the path of textual character mapping for
character set $csid according to REGISTRY file of Uni
code::Map.
style
$path = $Map -> style ($csid)
Returns the style of textual character mapping for
character set $csid according to REGISTRY file of Uni
code::Map.
write_binary_mapping
1||0 = $Map -> write_binary_mapping ($csid, $path)
Stores a mapping that has been loaded via method
read_text_mapping in file $path.

DEPRECATED METHODS

Some functionality is no longer promoted.

noise
Deprecated! Don't use any longer.
reverse_unicode
Deprecated! Use Unicode::String::byteswap instead.

BINARY MAPPINGS

Structure of binary Mapfiles

Unicode character mapping tables have sequences of sequen
tial key and sequential value codes. This property is used
to crunch the maps easily. n (0<n<256) sequential charac
ters are represented as a bytecount n and the first char
acter code key_start. For these subsequences the according
value sequences are crunched together, also. The value 0
is used to start an extended information block (that is
just partially implemented, though).

One could think of two ways to make a binary mapfile.
First method would be first to write a list of all key
codes, and then to write a list of all value codes. Second
method, used here, appends to all partial key code lists
the according crunched value code lists. This makes value
codes a little bit closer to key codes.

Note: the file format is still in a very liquid state. Neither rely on that it will stay as this, nor that the description is bugless, nor that all features are imple mented.

STRUCTURE:

<main>:
offset structure value
0x00 word 0x27b8 (magic)
0x02 @(<extended> || <submapping>)
The mapfile ends with extended mode <end> in main
stream.
<submapping>:
0x00 byte != 0 charsize1 (bits)
0x01 byte n1 number of chars for one
entry
0x02 byte charsize2 (bits)
0x03 byte n2 number of chars for one
entry
0x04 @(<extended> || <key_seq> || <key_val_seq)
bs1=int((charsize1+7)/8), bs2=int((charsize2+7)/8)
One submapping ends when <mapend> entry occurs.
<key_val_seq>:
0x00 size=0|1|2|4 n, number of sequential char
acters
size bs1 key1
+bs1 bs2 value1
+bs2 bs1 key2
+bs1 bs2 value2
...
key_val_seq ends, if either file ends (n = infinite
mode) or n pairs are read.
<key_seq>:
0x00 byte n, number of sequential char
acters
0x01 bs1 key_start, first character of
sequence
1+bs1 @(<extended> || <val_seq>)
A key sequence starts with a byte count telling how
long the sequence is. It is followed by the key start
code. After this comes a list of value sequences. The
list of value sequences ends, if sum(m) equals n.
<val_seq>:
0x00 byte m, number of sequential char
acters
0x01 bs2 val_start, first character of
sequence
<extended>:
0x00 byte 0
0x01 byte ftype
0x02 byte fsize, size of following
structure
0x03 fsize bytes something
For future extensions or private use one can insert
here 1..255 byte long streams. ftype can have values
30..255, values 0..29 are reserved. Modi are not fully
defined now and could change. They will be explained
later.

TO BE DONE

- Something clever, when a character has no translation.

- Direct charset -> charset mapping.

- Better performance.

- Support for mappings according to RFC 1345.

SEE ALSO

- File "REGISTRY" and binary mappings in directory "Uni
code/Map" of your perl library path
- recode(1), map(1), mkmapfile(1), Unicode::Map(3), Uni_
code::Map8(3), Unicode::String(3), Unicode::Char_ Name(3), mirrorMappings(1)
- RFC 1345
- Mappings at Unicode consortium ftp://ftp.uni
code.org/MAPPINGS/
- Registrated Internet character sets
ftp://dkuug.dk/i18n/charmaps/
- 2do: more references

AUTHOR

Martin Schwartz <martin@nacho.de>
Copyright © 2010-2025 Platon Technologies, s.r.o.           Home | Man pages | tLDP | Documents | Utilities | About
Design by styleshout