diff options
-rw-r--r-- | pod/perli18n.pod | 180 |
1 files changed, 112 insertions, 68 deletions
diff --git a/pod/perli18n.pod b/pod/perli18n.pod index 891f95ef48..aea6b4ac57 100644 --- a/pod/perli18n.pod +++ b/pod/perli18n.pod @@ -5,10 +5,10 @@ perl18n - Perl i18n (internalization) =head1 DESCRIPTION Perl supports the language-specific notions of data like -"is this a letter" and "which letter comes first". These +"is this a letter" and "which letter comes first". These are very important issues especially for languages other than English -- but also for English: it would be very -naïve indeed to think that C<A-Za-z> defines all the letters. +naïve indeed to think that C<A-Za-z> defines all the "letters". Perl understands the language-specific data via the standardized (ISO C, XPG4, POSIX 1.c) method called "the locale system". @@ -33,26 +33,27 @@ In runtime you can switch locales using the POSIX::setlocale(). $old_locale = setlocale(LC_CTYPE); setlocale(LC_CTYPE, "fr_CA.ISO8859-1"); - # for LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1" + # LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1" setlocale(LC_CTYPE, ""); - # for LC_CTYPE now in locale what the LC_ALL / LC_CTYPE / LANG define. + # LC_CTYPE now in locale what the LC_ALL / LC_CTYPE / LANG define. # see below for documentation about the LC_ALL / LC_CTYPE / LANG. # restore the old locale setlocale(LC_CTYPE, $old_locale); The first argument of C<setlocale()> is called B<the category> and the -second argument B<the locale>. The category tells in what aspect of data -processing we want to apply language-specific rules, the locale tells -in what language-country/territory-codeset - but read on for the naming -of the locales: not all systems name locales as in the example. +second argument B<the locale>. The category tells in what aspect of +data processing we want to apply language-specific rules, the locale +tells in what language-country/territory-codeset - but read on for the +naming of the locales: not all systems name locales as in the example. For further information about the categories, please consult your -L<setlocale(3)> manual. For the locales available in your system, also -consult the L<setlocale(3)> manual and see whether it leads you to the -list of the available locales (search for the C<SEE ALSO> section). If -that fails, try out in command line the following commands: +L<setlocale(3)> manual. For the locales available in your system, +also consult the L<setlocale(3)> manual and see whether it leads you +to the list of the available locales (search for the C<SEE ALSO> +section). If that fails, try out in command line the following +commands: =over 12 @@ -76,60 +77,101 @@ and see whether they list something resembling these english german russian english.iso88591 german.iso88591 russian.iso88595 -Sadly enough even if the calling interface has been standardized -the names of the locales are not. The naming usually is -language-country/territory-codeset but the latter parts may -not be present. Two special locales are worth special mention: - - "C" - -and - "POSIX" +Sadly enough even if the calling interface has been standardized the +names of the locales are not. The naming usually is +language_country/territory.codeset but the latter parts may not be +present. +Two special locales are worth special mention: C<"C"> and C<"POSIX">. Currently and effectively these are the same locale: the difference is mainly that the first one is defined by the C standard and the second -one is defined by the POSIX standard. What they mean and define is the -B<default locale> in which every program does start in. The language -is (American) English and the character codeset C<ASCII>. -B<NOTE>: not all systems have the C<"POSIX"> locale (not all systems -are POSIX): use the C<"C"> locale when you need the default locale. +one is defined by the POSIX standard. What they mean and define is +the B<default locale> in which every program does start in. The +language is (American) English and the character codeset C<ASCII>. +B<NOTE>: Not all systems have the C<"POSIX"> locale (not all systems +are POSIX), so use the C<"C"> locale when you need the default locale. -=head2 Category LC_CTYPE: CHARACTER TYPES +=head2 The C<use locale> Pragma -Starting from Perl version 5.002 perl has obeyed the C<LC_CTYPE> -environment variable which controls application's notions on -which characters are alphabetic characters. This affects in -Perl the regular expression metanotation +By default, Perl ignores the current locale. The C<use locale> pragma +tells Perl to use the current locale for some operations: The +comparison functions (lt, le, eq, cmp, ne, ge, gt, sort) use +C<LC_COLLATE>; regular expressions and case-modification functions +(uc, lc, ucfirst, lcfirst) use C<LC_CTYPE>; and formatting functions +(printf and sprintf) use C<LC_NUMERIC>. The default behavior returns +with C<no locale> or by reaching the end of the enclosing block. - \w +Note that the result of any operation that uses locale information is +tainted, since locales can be created by unprivileged users on some +systems (see L<perlsec.pod>). -which stands for alphanumeric characters, that is, alphabetic and -numeric characters (please consult L<perlre> for more information -about regular expressions). Thanks to the C<LC_CTYPE>, depending on -your locale settings, characters like C<Æ>, C<É>, C<ß>, C<ø>, can be -understood as C<\w> characters. +=head2 Category LC_COLLATE: Collation -=head2 Category LC_COLLATE: COLLATION - -Starting from Perl version 5.003_06 perl has obeyed the B<LC_COLLATE> +When in the scope of C<use locale>, Perl obeys the B<LC_COLLATE> environment variable which controls application's notions on the -collation (ordering) of the characters. C<B> does in most Latin +collation (ordering) of the characters. C<B> does in most Latin alphabets follow the C<A> but where do the C<Á> and C<Ä> belong? +B<NOTE>: Comparing and sorting by locale is usually slower than the +default sorting; factors of 2 to 4 have been observed. It will also +consume more memory: while a Perl scalar variable is participating in +any string comparison or sorting operation and obeying the locale +collation rules it will take about 3-15 (the exact value depends on +the operating system) times more memory than normally. These downsides +are dictated more by the operating system implementation of the locale +system than by Perl. + Here is a code snippet that will tell you what are the alphanumeric characters in the current locale, in the locale order: - perl -le 'print sort grep /\w/, map { chr() } 0..255' + use POSIX qw(setlocale LC_COLLATE); + use locale; -As noted above, this will work only for Perl versions 5.003_06 and up. + setlocale(LC_COLLATE, ""); + print +(sort grep /\w/, map { chr() } 0..255), "\n"; -B<NOTE>: in the pre-5.003_06 Perl releases the per-locale collation -was possible using the C<I18N::Collate> library module. This is now -mildly obsolete and to be avoided. The C<LC_COLLATE> functionality is +The default collation must be used for example for sorting raw binary +data whereas the locale collation is useful for natural text. + +B<NOTE>: In some locales some characters may have no collation value +at all -- this means for example if the C<'-'> is such a character the +C<relocate> and C<re-locate> may sort to the same place. + +B<NOTE>: For certain environments the locale support by the operating +system is very simply broken and cannot be used or fixed by Perl. Such +deficiencies can and will result in mysterious hangs and/or Perl core +dumps. One such example is IRIX before the release 6.2, the +C<LC_COLLATE> support simply does not work. When confronted with such +systems, please report in excruciating detail to C<perlbug@perl.com>, +complain to your vendor, maybe some bug fixes exist for your operating +system for these problems? Sometimes such bug fixes are called an +operating system upgrade. + +B<NOTE>: In the pre-5.003_06 Perl releases the per-locale collation +was possible using the C<I18N::Collate> library module. This is now +mildly obsolete and to be avoided. The C<LC_COLLATE> functionality is integrated into the Perl core language and one can use scalar data completely normally -- there is no need to juggle with the scalar references of C<I18N::Collate>. +=head2 Category LC_CTYPE: Character Types + +When in the scope of C<use locale>, Perl obeys the C<LC_CTYPE> locale +information which controls application's notions on which characters +are alphabetic characters. This affects in Perl the regular expression +metanotation C<\\w> which stands for alphanumeric characters, that is, +alphabetic and numeric characters (please consult L<perlre> for more +information about regular expressions). Thanks to the C<LC_CTYPE>, +depending on your locale settings, characters like C<Æ>, C<É>, +C<ß>, C<ø>, may be understood as C<\w> characters. + +=head2 Category LC_NUMERIC: Numeric Formatting + +When in the scope of C<use locale>, Perl obeys the C<LC_NUMERIC> +locale information which controls application's notions on how numbers +should be formatted for input and output. This affects in Perl the +printf and fprintf function, as well as POSIX::strtod. + =head1 ENVIRONMENT =over 12 @@ -137,16 +179,17 @@ references of C<I18N::Collate>. =item PERL_BADLANG A string that controls whether Perl warns in its startup about failed -locale settings. This can happen if the locale support in the -operating system is lacking (broken) is some way. If this string has +locale settings. This can happen if the locale support in the +operating system is lacking (broken) is some way. If this string has an integer value differing from zero, Perl will not complain. -B<NOTE>: this is just hiding the warning message: the message tells + +B<NOTE>: This is just hiding the warning message. The message tells about some problem in your system's locale support and you should investigate what the problem is. =back -The following environment variables are not specific to Perl: they are +The following environment variables are not specific to Perl: They are part of the standardized (ISO C, XPG4, POSIX 1.c) setlocale method to control an application's opinion on data. @@ -159,32 +202,33 @@ set, it overrides all the rest of the locale environment variables. =item LC_CTYPE -C<LC_ALL> controls the classification of characters, see above. - -If this is unset and the C<LC_ALL> is set, the C<LC_ALL> is used as -the C<LC_CTYPE>. If both this and the C<LC_ALL> are unset but the C<LANG> -is set, the C<LANG> is used as the C<LC_CTYPE>. -If none of these three is set, the default locale C<"C"> -is used as the C<LC_CTYPE>. +In the absence of C<LC_ALL>, C<LC_CTYPE> chooses the character type +locale. In the absence of both C<LC_ALL> and C<LC_CTYPE>, C<LANG> +chooses the character type locale. =item LC_COLLATE -C<LC_ALL> controls the collation of characters, see above. +In the absence of C<LC_ALL>, C<LC_COLLATE> chooses the collation +locale. In the absence of both C<LC_ALL> and C<LC_COLLATE>, C<LANG> +chooses the collation locale. + +=item LC_NUMERIC -If this is unset and the C<LC_ALL> is set, the C<LC_ALL> is used as -the C<LC_CTYPE>. If both this and the C<LC_ALL> are unset but the -C<LANG> is set, the C<LANG> is used as the C<LC_COLLATE>. -If none of these three is set, the default locale C<"C"> -is used as the C<LC_COLLATE>. +In the absence of C<LC_ALL>, C<LC_NUMERIC> chooses the numeric format +locale. In the absence of both C<LC_ALL> and C<LC_NUMERIC>, C<LANG> +chooses the numeric format. =item LANG -LC_ALL is the "catch-all" locale environment variable. If it is set, -it is used as the last resort if neither of the C<LC_ALL> and the -category-specific C<LC_...> are set. +C<LANG> is the "catch-all" locale environment variable. If it is set, +it is used as the last resort after the overall C<LC_ALL> and the +category-specific C<LC_...>. =back There are further locale-controlling environment variables -(C<LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME>) but Perl -B<does not> currently obey them. +(C<LC_MESSAGES, LC_MONETARY, LC_TIME>) but Perl B<does not> currently +use them, except possibly as they affect the behavior of library +functions called by Perl extensions. + +=cut |