Add \o{} escape

This commit adds the new construct \o{} to express a character constant by its octal ordinal value, along with ancillary tests and documentation. A function to handle this is added to util.c, and it is called from the 3 parsing places it could occur. The function is a candidate for in-lining, though I doubt that it will ever be used frequently.
author: Karl Williamson <khw@khw-desktop.(none)> 2010-07-15 17:28:28 -0600
committer: David Golden <dagolden@cpan.org> 2010-07-17 21:50:48 -0400
commit: f0a2b745ce6c03aec6412d79ce0b782f20eddce4 (patch)
tree: d1786b1a4a80f6b848dca1ab4eba6e3ffd5dc5d1 /pod/perlrebackslash.pod
parent: 8e4698ef1ed0da722532bfcc769ba22fe85c4b47 (diff)
download: perl-f0a2b745ce6c03aec6412d79ce0b782f20eddce4.tar.gz
1 files changed, 57 insertions, 26 deletions
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index 9d246bdc2e..d460f7f052 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -62,7 +62,7 @@ quoted constructs>.
 Those not usable within a bracketed character class (like C<[\da-z]>) are marked
 as C<Not in [].>
 
- \000              Octal escape sequence.
+ \000              Octal escape sequence.  See also \o{}.
  \1                Absolute backreference.  Not in [].
  \a                Alarm or bell.
  \A                Beginning of string.  Not in [].
@@ -86,6 +86,7 @@ as C<Not in [].>
  \n                (Logical) newline character.
  \N                Any character but newline.  Experimental.  Not in [].
  \N{}              Named or numbered (Unicode) character.
+ \o{}              Octal escape sequence.
  \p{}, \pP         Character with the given Unicode property.
  \P{}, \PP         Character without the given Unicode property.
  \Q                Quotemeta till \E.  Not in [].
@@ -207,33 +208,57 @@ match "as is".
 
 =head3 Octal escapes
 
-Octal escapes consist of a backslash followed by three octal digits
-matching the code point of the character you want to use.  (In some contexts,
-two or even one octal digits are also accepted, sometimes with a warning.) This
-allows for 512 characters (C<\000> up to C<\777>) that can be expressed this
-way.  Enough in pre-Unicode days,
-but most Unicode characters cannot be escaped this way.
+There are two forms of octal escapes.  Each is used to specify a character by
+its ordinal, specified in octal notation.
+
+One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots
+represent one or more octal digits.  It can be used for any Unicode character.
+
+It was introduced to avoid the potential problems with the other form,
+available in all Perls.  That form consists of a backslash followed by three
+octal digits.  One problem with this form is that it can look exactly like an
+old-style backreference (see
+L</Disambiguation rules between old-style octal escapes and backreferences>
+below.)  You can avoid this by making the first of the three digits always a
+zero, but that makes \077 the largest ordinal unambiguously specifiable by this
+form.
+
+In some contexts, a backslash followed by two or even one octal digits may be
+interpreted as an octal escape, sometimes with a warning, and because of some
+bugs, sometimes with surprising results.  Also, if you are creating a regex
+out of smaller snippets concatentated together, and you use fewer than three
+digits, the beginning of one snippet may be interpreted as adding digits to the
+ending of the snippet before it.  See L</Absolute referencing> for more
+discussion and examples of the snippet problem.
 
 Note that a character that is expressed as an octal escape is considered
 as a character without special meaning by the regex engine, and will match
 "as is".
 
-=head4 Examples (assuming an ASCII platform)
+To summarize, the C<\o{}> form is always safe to use, and the other form is
+safe to use for ordinals up through \077 when you use exactly three digits to
+specify them.
 
- $str = "Perl";
- $str =~ /\120/;    # Match, "\120" is "P".
- $str =~ /\120+/;   # Match, "\120" is "P", it is repeated at least once
- $str =~ /P\053/;   # No match, "\053" is "+" and taken literally.
+Mnemonic: I<0>ctal or I<o>ctal.
 
-=head4 Caveat
+=head4 Examples (assuming an ASCII platform)
 
-Octal escapes potentially clash with old-style backreferences (see L</Absolute
-referencing> below). They both consist of a backslash followed by numbers. So
-Perl has to use heuristics to determine whether it is a backreference or an
-octal escape. You can avoid ambiguity by using the C<\g> form for
-backreferences, and by beginning octal escapes with a "0".  (Since octal
-escapes are 3 digits, this latter method works only up to C<\077>.)  In the
-absence of C<\g>, Perl uses the following rules:
+ $str = "Perl";
+ $str =~ /\o{120}/;  # Match, "\120" is "P".
+ $str =~ /\120/;     # Same.
+ $str =~ /\o{120}+/; # Match, "\120" is "P", it's repeated at least once
+ $str =~ /\120+/;    # Same.
+ $str =~ /P\053/;    # No match, "\053" is "+" and taken literally.
+ /\o{23073}/         # Black foreground, white background smiling face.
+ /\o{4801234567}/    # Raises a warning, and yields chr(4)
+
+=head4 Disambiguation rules between old-style octal escapes and backreferences
+
+Octal escapes of the C<\000> form outside of bracketed character classes
+potentially clash with old-style backreferences.  (see L</Absolute referencing>
+below).  They both consist of a backslash followed by numbers.  So Perl has to
+use heuristics to determine whether it is a backreference or an octal escape.
+Perl uses the following rules to disambiguate:
 
 =over 4
 
@@ -258,18 +283,24 @@ the rest are matched as is.
     $pat .= ")" x 999;
  /^($pat)\1000$/;   #  Matches 'aa'; there are 1000 capture groups.
  /^$pat\1000$/;     #  Matches 'a@0'; there are 999 capture groups
-                    #    and \1000 is seen as \100 (a '@') and a '0'.
+                    #    and \1000 is seen as \100 (a '@') and a '0'
 
 =back
 
+You can the force a backreference interpretation always by using the C<\g{...}>
+form.  You can the force an octal interpretation always by using the C<\o{...}>
+form, or for numbers up through \077 (= 63 decimal), by using three digits,
+beginning with a "0".
+
 =head3 Hexadecimal escapes
 
-Hexadecimal escapes start with C<\x> and are then either followed by a
-two digit hexadecimal number, or a hexadecimal number of arbitrary length
-surrounded by curly braces. The hexadecimal number is the code point of
-the character you want to express.
+Like octal escapes, there are two forms of hexadecimal escapes, but both start
+with the same thing, C<\x>.  This is followed by either exactly two hexadecimal
+digits forming a number, or a hexadecimal number of arbitrary length surrounded
+by curly braces. The hexadecimal number is the code point of the character you
+want to express.
 
-Note that a character that is expressed as a hexadecimal escape is considered
+Note that a character that is expressed as one of these escapes is considered
 as a character without special meaning by the regex engine, and will match
 "as is".
author	Karl Williamson <khw@khw-desktop.(none)>	2010-07-15 17:28:28 -0600
committer	David Golden <dagolden@cpan.org>	2010-07-17 21:50:48 -0400
commit	f0a2b745ce6c03aec6412d79ce0b782f20eddce4 (patch)
tree	d1786b1a4a80f6b848dca1ab4eba6e3ffd5dc5d1 /pod/perlrebackslash.pod
parent	8e4698ef1ed0da722532bfcc769ba22fe85c4b47 (diff)
download	perl-f0a2b745ce6c03aec6412d79ce0b782f20eddce4.tar.gz