PATCH: [perl #56444] delayed interpolation of \N{...}

make regen embed.fnc needs to be run on this patch. This patch fixes Bugs #56444 and #62056. Hopefully we have finally gotten this right. The parser used to handle all the escaped constants, expanding \x2e to its single byte equivalent. The problem is that for regexp patterns, this is a '.', which is a metacharacter and has special meaning that \x2e does not. So things were changed so that the parser didn't expand things in patterns. But this causes problems for \N{NAME}, when the pattern doesn't get evaluated until runtime, as for example when it has a scalar reference in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was in effect at the point during the parsing phase that this regex was encountered in, but we don't actually look at it until runtime, when these bug reports show that it is gone. The solution is for the tokenizer to parse \N{NAME}, but to compile it into an intermediate value that won't ever be considered a metacharacter. We have chosen to compile NAME to its equivalent code point value, and express it in the already existing \N{U+...} form. This indicates to the regex compiler that the original input was a named character and retains the value it had at that point in the parse. This means that \N{U+...} now always must imply Unicode semantics for the string or pattern it appeared in. Previously there was an inconsistency, where effectively \N{NAME} implied Unicode semantics, but \N{U+...} did not necessarily. So now, any string or pattern that has either of these forms is utf8 upgraded. A complication is that a charnames handler can return a sequence of multiple characters instead of just one. To deal with this case, the tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where c1 etc are the individual characters. Perhaps this will be made a public interface someday, but I decided to not expose it externally as far as possible for now in case we find reason to change it. It is possible to defeat this by passing it in a single quoted string to the regex compiler, so the documentation will be changed to discourage that. A further complication is that \N can have an additional meaning: to match a non-newline. This means that the two meanings have to be disambiguated. embed.fnc was changed to make public the function regcurly() in regcomp.c so that it could be referred to in toke.c to see if the ... in \N{...} is a legal quantifier like {2,}. This is used in the disambiguation. toke.c was changed to update some out-dated relevant comments. It now parses \N in patterns. If it determines that it isn't a named sequence, it passes it through unchanged. This happens when there is no brace after the \N, or no closing brace, or if the braces enclose a legal quantifier. Previously there has been essentially no restriction on what can come between the braces so that a custom translator can accept virtually anything. Now, legal quantifiers are assumed to mean that the \N is a "match non-newline that quantity of times". I removed the #ifdef'd out code that had been left in in case pack U reverted to earlier behavior. I did this because it complicated things, and because the change to pack U has been in long enough and shown that it is correct so it's not likely to be reverted. \N meaning a named character is handled differently depending on whether this is a pattern or not. In all cases, the output will be upgraded to utf8 because a named character implies Unicode semantics. If not a pattern, the \N is parsed into a utf8 string, as before. Otherwise it will be parsed into the intermediate \N{U+...} form. If the original was already a valid \N{U+...} constant, it is passed through unchanged. I now check that the sequence returned by the charnames handler is not malformed, which was lacking before. The code in regcomp.c which dealt with interfacing with the charnames handler has been removed. All the values should be determined by the time regcomp.c gets involved. The affected subroutine is necessarily restructured. An EXACT-type node is generated for the character sequence. Such a node has a capacity of 255 bytes, and so it is possible to overflow it. This wasn't checked for before, but now it is, and a warning issued and the overflowing characters are discarded.
author: Karl Williamson <khw@khw-desktop.(none)> 2010-02-18 13:41:09 -0700
committer: Rafael Garcia-Suarez <rgs@consttype.org> 2010-02-19 10:10:45 +0100
commit: ff3f963aa0f95ea53996b6a3842b824504b57c79 (patch)
tree: 4abf450af4da96f0807bcb677721826f3035e7eb /pod/perl5120delta.pod
parent: 8df7d2a3b2f29aff6003c365b42d16552e518406 (diff)
download: perl-ff3f963aa0f95ea53996b6a3842b824504b57c79.tar.gz
1 files changed, 3 insertions, 20 deletions
diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod
index aebeedfddb..47304ff55f 100644
--- a/pod/perl5120delta.pod
+++ b/pod/perl5120delta.pod
@@ -237,9 +237,10 @@ for some or all operations. (Yuval Kogman)
 
 A new regex escape has been added, C<\N>. It will match any character that
 is not a newline, independently from the presence or absence of the single
-line match modifier C</s>. (If C<\N> is followed by an opening brace and
+line match modifier C</s>.  It is not usable within a character class.
+(If C<\N> is followed by an opening brace and
 by a letter, perl will still assume that a Unicode character name is
-coming, so compatibility is preserved.) (Rafael Garcia-Suarez)
+coming, so compatibility is preserved.) (Rafael Garcia-Suarez).
 
 This will break a L<custom charnames translator|charnames/CUSTOM TRANSLATORS>
 which allows numbers for character names, as C<\N{3}> will now mean to match 3
@@ -2464,24 +2465,6 @@ take a block as their first argument, like
 
 =item *
 
-The C<charnames> pragma may generate a run-time error when a regex is
-interpolated [RT #56444]:
-
-    use charnames ':full';
-    my $r1 = qr/\N{THAI CHARACTER SARA I}/;
-    "foo" =~ $r1;    # okay
-    "foo" =~ /$r1+/; # runtime error
-
-A workaround is to generate the character outside of the regex:
-
-    my $a = "\N{THAI CHARACTER SARA I}";
-    my $r1 = qr/$a/;
-
-However, C<$r1> must be used within the scope of the C<use charnames> for this
-to work.
-
-=item *
-
 Some regexes may run much more slowly when run in a child thread compared
 with the thread the pattern was compiled into [RT #55600].
author	Karl Williamson <khw@khw-desktop.(none)>	2010-02-18 13:41:09 -0700
committer	Rafael Garcia-Suarez <rgs@consttype.org>	2010-02-19 10:10:45 +0100
commit	ff3f963aa0f95ea53996b6a3842b824504b57c79 (patch)
tree	4abf450af4da96f0807bcb677721826f3035e7eb /pod/perl5120delta.pod
parent	8df7d2a3b2f29aff6003c365b42d16552e518406 (diff)
download	perl-ff3f963aa0f95ea53996b6a3842b824504b57c79.tar.gz