string(3)

NAME

Unicode::String - String of Unicode characters
(UCS2/UTF16)

SYNOPSIS

use Unicode::String qw(utf8 latin1 utf16);
$u = utf8("The Unicode Standard is a fixed-width, uniform
");
$u  .=  utf8("encoding  scheme for written characters and
text");
# convert to various external formats
print $u->ucs4;      # 4 byte characters
print $u->utf16;     # 2 byte characters + surrogates
print $u->utf8;      # 1-4 byte characters
print $u->utf7;      # 7-bit clean format
print $u->latin1;    # lossy
print $u->hex;       # a hexadecimal string
# all these can be used to set string value  or  as  constructor
$u->latin1("Å være eller å ikke være");
$u = utf16(" Å   v æ r e");
# string operations
$u2 = $u->copy;
$u->append($u2);
$u->repeat(2);
$u->chop;
$u->length;
$u->index($other);
$u->index($other, $pos);
$u->substr($offset);
$u->substr($offset, $length);
$u->substr($offset, $length, $substitute);
# overloading
$u .= "more";
$u = $u x 100;
print "$u0;
# string <--> array of numbers
@array = $u->unpack;
$u->pack(@array);
# misc
$u->ord;
$u = uchr($num);

DESCRIPTION

A Unicode::String object represents a sequence of Unicode characters. The Unicode Standard is a fixed-width, uni
form encoding scheme for written characters and text.
This encoding treats alphabetic characters, ideographic
characters, and symbols identically, which means that they
can be used in any mixture and with equal facility. Uni
code is modeled on the ASCII character set, but uses a
16-bit encoding to support full multilingual text.

Internally a Unicode::String object is a string of 2 byte values in network byte order (big-endian). The class pro
vide various methods to convert from and to various exter
nal formats, and all string manipulations are made on
strings in this the internal 16-bit format.

The functions utf16(), utf8(), utf7(), ucs2(), ucs4(), latin1(), uchr() can be imported from the Unicode::String module and will work as constructors initializing strings
of the corresponding encoding. The ucs2() and utf16() are really aliases for the same function.

The Unicode::String objects overload various operators, so they will normally work like plain 8-bit strings in Perl.
This includes conversions to strings, numbers and booleans
as well as assignment, concatenation and repetition.

METHODS

The following methods are available:

Unicode::String->stringify_as( [$enc] )
This class method specify which encoding will be used
when Unicode::String objects are implicitly converted to and from plain strings. It define which encoding
to assume for the argument of the Unicode::String con structor new(). Without an encoding argument,
stringify_as() returns the current encoding ctor func tion. The encoding argument ($enc) is a string with
one of the following values: "ucs4", "ucs2", "utf16",
"utf8", "utf7", "latin1", "hex". The default is
"utf8".
$us = Unicode::String->new( [$initial_value] )
This is the customary object constructor. Without
argument, it creates an empty Unicode::String object. If an $initial_value argument is given, it is decoded
according to the specified stringify_as() encoding and used to initialize the newly created object.
Normally you create Unicode::String objects by import ing some of the encoding methods below as functions
into your namespace and calling them with an appropri
ate encoded argument.
$us->ucs4( [$newval] )
The UCS-4 encoding use 32 bits per character. The
main benefit of this encoding is that you don't have
to deal with surrogate pairs. Encoded as a Perl
string we use 4-bytes in network byte order for each
character.
The ucs4() method always return the old value of $us
and if given an argument decodes the UCS-4 string and
set this as the new value of $us. The characters in
$newval must be in the range 0x0 .. 0x10FFFF. Charac
ters outside this range is ignored.
$us->ucs2( [$newval] )
$us->utf16( [$newval] )
The ucs2() and utf16() are really just different names for the same method. The UCS-2 encoding use 16 bits
per character. The UTF-16 encoding is identical to
UCS-2, but includes the use of surrogate pairs. Sur
rogates make it possible to encode characters in the
range 0x010000 .. 0x10FFFF with the use of two consec
utive 16-bit chars. Encoded as a Perl string we use
2-bytes in network byte order for each character (or
surrogate code).
The ucs2() method always return the old value of $us
and if given an argument set this as the new value of
$us.
$us->utf8( [$newval] )
The UTF-8 encoding use 8-bit for the encoding of char
acters in the range 0x0 .. 0x7F, 16-bit for the encod
ing of characters in the range 0x80 .. 0x7FF, 24-bit
for the encoding of characters in the range 0x800 ..
0xFFFF and 32-bit for characters in the range 0x01000
.. 0x10FFFF. Americans like this encoding, because
plain US-ASCII characters are still US-ASCII. Another
benefit is that the character ' ' only occurs as the
encoding of 0x0, thus the normal NUL-terminated
strings (popular in the C programming language) can
still be used.
The utf8() method always return the old value of $us
encoded using UTF-8 and if given an argument decodes
the UTF-8 string and set this as the new value of $us.
$us->utf7( [$newval] )
The UTF-7 encoding only use plain US-ASCII characters
for the encoding. This makes it safe for transport
through 8-bit stripping protocols. Characters outside
the US-ASCII range are base64-encoded and '+' is used
as an escape character. The UTF-7 encoding is
described in RFC1642.
The utf7() method always return the old value of $us
encoded using UTF-7 and if given an argument decodes
the UTF-7 string and set this as the new value of $us.
If the (global) variable $Uni
code::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then
a wider range of characters are encoded as themselves.
It is even TRUE by default. The characters affected
by this are:

! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
$us->latin1( [$newval] )
The first 256 codes of Unicode is identical to the
ISO-8859-1 8-bit encoding, also known as Latin-1. The
latin1() method always return the old value of $us and if given an argument set this as the new value of $us.
Characters outside the 0x0 .. 0xFF range are ignored
when returning a Latin-1 string. If you want more
control over the mapping from Unicode to Latin-1, use
the Unicode::Map8 class. This is also the way to deal with other 8-bit character sets.
$us->hex( [$newval] )
This method() return a plain ASCII string where each Unicode character is represented by the "U+XXXX"
string and separated by a single space character.
This format can also be used to set the value of $us
(in which case the "U+" is optional).
$us->as_string;
Converts a Unicode::String to a plain string according to the setting of stringify_as(). The default stringify_as() method is "utf8".
$us->as_num;
Converts a Unicode::String to a number. Currently only the digits in the range 0x30 .. 0x39 are
recognized. The plan is to eventually support all
Unicode digit characters.
$us->as_bool;
Converts a Unicode::String to a boolean value. Only the empty string is FALSE. A string consisting of
only the character U+0030 is considered TRUE, even if
Perl consider "0" to be FALSE.
$us->repeat( $count );
Returns a new Unicode::String where the content of $us is repeated $count times. This operation is also
overloaded as:

$us x $count
$us->concat( $other_string );
Concatenates the string $us and the string
$other_string. If $other_string is not an Uni_
code::String object, then it is first passed to the Unicode::String->new constructor function. This oper
ation is also overloaded as:

$us . $other_string
$us->append( $other_string );
Appends the string $other_string to the value of $us.
If $other_string is not an Unicode::String object, then it is first passed to the Unicode::String->new
constructor function. This operation is also over
loaded as:

$us .= $other_string
$us->copy;
Returns a copy of the current Unicode::String object. This operation is overloaded as the assignment opera
tor.
$us->length;
Returns the length of the Unicode::String. Surrogate pairs are still counted as 2.
$us->byteswap;
This method will swap the bytes in the internal repre
sentation of the Unicode::String object.
Unicode reserve the character U+FEFF character as a
byte order mark. This works because the swapped char
acter, U+FFFE, is reserved to not be valid. For
strings that have the byte order mark as the first
character, we can guaranty to get the byte order right
with the following code:

$ustr->byteswap if $ustr->ord == 0xFFFE;
$us->unpack;
Returns a list of integers each representing an UTF-16
character code.
$us->pack( @uchr );
Sets the value of $us as a sequence of UTF-16 charac
ters with the characters codes given as parameter.
$us->ord;
Returns the character code of the first character in
$us. The ord() method deals with surrogate pairs,
which gives us a result-range of 0x0 .. 0x10FFFF. If
the $us string is empty, undef is returned.
$us->chr( $code );
Sets the value of $us to be a string containing the
character assigned code $code. The argument $code
must be an integer in the range 0x0 .. 0x10FFFF. If
the code is greater than 0xFFFF then a surrogate pair
created.
$us->name
In scalar context returns the official Unicode name of
the first character in $us. In array context returns
the name of all characters in $us. Also see Uni
code::CharName.
$us->substr( $offset, [$length, [$subst]] )
Returns a sub-string of $us. Works similar to the
builtin substr function, but because we can't make
LVALUE subs yet, you have to pass the string you want
to assign to the sub-string as the 3rd parameter.
$us->index( $other, [$pos] );
Locates the position of $other within $us, possibly
starting the search at position $pos.
$us->chop;
Chops off the last character of $us and returns it (as
a Unicode::String object).

FUNCTIONS

The following utility functions are provided. They will
be exported on request.

byteswap2($str, ...)
This function will swap 2 and 2 bytes in the strings
passed as arguments. This can be used to fix up
UTF-16 or UCS-2 strings from litle-endian systems. If
this function is called in void context, then it will
modify its arguments in-place. Otherwise, then
swapped strings are returned.
byteswap4($str, ...)
The byteswap4 function works similar to byteswap2, but
will reverse the order of 4 and 4 bytes. Can be used
to fix litle-endian UCS-4 strings.

SEE ALSO

Unicode::CharName, Unicode::Map8, http://www.unicode.org/

COPYRIGHT

Copyright 1997-2000 Gisle Aas.

This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
Copyright © 2010-2025 Platon Technologies, s.r.o.           Home | Man pages | tLDP | Documents | Utilities | About
Design by styleshout