diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2001-11-06 03:05:34 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2001-11-06 03:05:34 +0000 |
commit | a72c75842468bcd2a7cf17032844c4040a5a31e2 (patch) | |
tree | f1d67259d9b154926eb495b329d3239f96b9be7c /lib/encoding.pm | |
parent | 545666dba9cc33d16d0b8341e36facdb43c44913 (diff) | |
download | perl-a72c75842468bcd2a7cf17032844c4040a5a31e2.tar.gz |
Implement the encoding pragma for regex literals.
p4raw-id: //depot/perl@12864
Diffstat (limited to 'lib/encoding.pm')
-rw-r--r-- | lib/encoding.pm | 23 |
1 files changed, 21 insertions, 2 deletions
diff --git a/lib/encoding.pm b/lib/encoding.pm index 6f5970f2ca..94ee3231fb 100644 --- a/lib/encoding.pm +++ b/lib/encoding.pm @@ -57,14 +57,33 @@ encoding pragma you can change this default. The pragma is a per script, not a per block lexical. Only the last C<use encoding> matters, and it affects B<the whole script>. +Notice that only literals (string or regular expression) having only +legacy code points are affected: if you mix data like this + + \xDF\x{100} + +the data is assumed to be in (Latin 1 and) Unicode, not in your native +encoding. In other words, this will match in "greek": + + "\xDF" =~ /\x{3af}/ + +but this will not + + "\xDF\x{100}" =~ /\x{3af}\x{100}/ + +since the C<\xDF> on the left will B<not> be upgraded to C<\x{3af}> +because of the C<\x{100}> on the left. You should not be mixing your +legacy data and Unicode in the same string. + If no encoding is specified, the environment variable L<PERL_ENCODING> is consulted. If that fails, "latin1" (ISO 8859-1) is assumed. If no encoding can be found, C<Unknown encoding '...'> error will be thrown. =head1 KNOWN PROBLEMS -Literals in regular expressions are not affected by this pragma. -They very probably should. +For native multibyte encodings (either fixed or variable length) +the current implementation of the regular expressions may introduce +recoding errors for longer regular expression literals than 127 bytes. =head1 SEE ALSO |