summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-11-14 11:41:03 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-11-14 11:41:03 +0000
commitcc87f05b24e3944bf2bb4416dfc704593ef0a6f5 (patch)
treec45329e620cddba38432e21350303089967ff6e4
parentfce480ed2031901b511711ff50ca67afe06080f0 (diff)
downloadpcre-cc87f05b24e3944bf2bb4416dfc704593ef0a6f5.tar.gz
Small tidies, and documentation update for JavaScript \x, \u, \U support.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@745 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog2
-rw-r--r--configure.ac6
-rw-r--r--doc/pcreapi.316
-rw-r--r--doc/pcrecompat.35
-rw-r--r--doc/pcrepattern.344
-rw-r--r--pcre_compile.c2
6 files changed, 54 insertions, 21 deletions
diff --git a/ChangeLog b/ChangeLog
index 97afcc3..b843407 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -17,7 +17,7 @@ Version 8.21
parentheses, for example, (?>a(*:m)), were not being passed out. This bug
was introduced by change 18 for 8.20.
-5. Supporting of \x and \u in JavaScript compatibility mode based on the
+5. Supporting of \x, \U and \u in JavaScript compatibility mode based on the
ECMA-262 standard.
diff --git a/configure.ac b/configure.ac
index b940e1e..ddee8e8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -9,9 +9,9 @@ dnl The PCRE_PRERELEASE feature is for identifying release candidates. It might
dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8])
-m4_define(pcre_minor, [20])
-m4_define(pcre_prerelease, [])
-m4_define(pcre_date, [2011-10-21])
+m4_define(pcre_minor, [21])
+m4_define(pcre_prerelease, [-RC1])
+m4_define(pcre_date, [2011-11-14])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 40a5954..da05d05 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -643,6 +643,20 @@ character). Thus, the pattern AB]CD becomes illegal when this option is set.
string (by default this causes the current matching alternative to fail). A
pattern such as (\e1)(a) succeeds when this option is set (assuming it can find
an "a" in the subject), whereas it fails by default, for Perl compatibility.
+.P
+(3) \eU matches an upper case "U" character; by default \eU causes a compile
+time error (Perl uses \eU to upper case subsequent characters).
+.P
+(4) \eu matches a lower case "u" character unless it is followed by four
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, \eu causes a compile time error (Perl uses it to upper
+case the following character).
+.P
+(5) \ex matches a lower case "x" character unless it is followed by two
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, as in Perl, a hexadecimal number is always expected after
+\ex, but it may have zero, one, or two digits (so, for example, \exz matches a
+binary zero character followed by z).
.sp
PCRE_MULTILINE
.sp
@@ -2530,6 +2544,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 23 September 2011
+Last updated: 14 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index 94e72b4..69bcf90 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -38,7 +38,8 @@ represent a binary zero.
own, matching a non-newline character, is supported.) In fact these are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE, an error is
-generated.
+generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
+\eU and \eu are interpreted as JavaScript interprets them.
.P
6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE is
built with Unicode character property support. The properties that can be
@@ -174,6 +175,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 09 October 2011
+Last updated: 14 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 15c3351..e35bf71 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -241,7 +241,8 @@ one of the following escape sequences than the binary character it represents:
\et tab (hex 09)
\eddd character with octal code ddd, or back reference
\exhh character with hex code hh
- \ex{hhh..} character with hex code hhh..
+ \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
+ \euhhhh character with hex code hhhh (JavaScript mode only)
.sp
The precise effect of \ecx is as follows: if x is a lower case letter, it
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
@@ -252,21 +253,28 @@ both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte
values are valid. A lower case letter is converted to upper case, and then the
0xc0 bits are flipped.)
.P
-After \ex, from zero to two hexadecimal digits are read (letters can be in
-upper or lower case). Any number of hexadecimal digits may appear between \ex{
-and }, but the value of the character code must be less than 256 in non-UTF-8
-mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
-hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
-point, which is 10FFFF.
+By default, after \ex, from zero to two hexadecimal digits are read (letters
+can be in upper or lower case). Any number of hexadecimal digits may appear
+between \ex{ and }, but the value of the character code must be less than 256
+in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
+value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
+Unicode code point, which is 10FFFF.
.P
If characters other than hexadecimal digits appear between \ex{ and }, or if
there is no terminating }, this form of escape is not recognized. Instead, the
initial \ex will be interpreted as a basic hexadecimal escape, with no
following digits, giving a character whose value is zero.
.P
+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
+as just described only when it is followed by two hexadecimal digits.
+Otherwise, it matches a literal "x" character. In JavaScript mode, support for
+code points greater than 256 is provided by \eu, which must be followed by
+four hexadecimal digits; otherwise it matches a literal "u" character.
+.P
Characters whose value is less than 256 can be defined by either of the two
-syntaxes for \ex. There is no difference in the way they are handled. For
-example, \exdc is exactly the same as \ex{dc}.
+syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
+way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
+\eu00dc in JavaScript mode).
.P
After \e0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \e0\ex\e07
@@ -328,6 +336,16 @@ unrecognized escape sequences, they are treated as the literal characters "B",
set. Outside a character class, these sequences have different meanings.
.
.
+.SS "Unsupported escape sequences"
+.rs
+.sp
+In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
+handler and used to modify the case of following characters. By default, PCRE
+does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
+option is set, \eU matches a "U" character, and \eu can be used to define a
+character by code point, as described in the previous section.
+.
+.
.SS "Absolute and relative back references"
.rs
.sp
@@ -387,7 +405,8 @@ This is the same as
.\" </a>
the "." metacharacter
.\"
-when PCRE_DOTALL is not set.
+when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
+PCRE does not support this.
.P
Each pair of lower and upper case escape sequences partitions the complete set
of characters into two disjoint sets. Any given character matches one, and only
@@ -964,7 +983,8 @@ special meaning in a character class.
.P
The escape sequence \eN behaves like a dot, except that it is not affected by
the PCRE_DOTALL option. In other words, it matches any character except one
-that signifies the end of a line.
+that signifies the end of a line. Perl also uses \eN to match characters by
+name; PCRE does not support this.
.
.
.SH "MATCHING A SINGLE BYTE"
@@ -2854,6 +2874,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 19 October 2011
+Last updated: 14 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/pcre_compile.c b/pcre_compile.c
index 2687c0b..7ece490 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -687,7 +687,6 @@ else
if ((digitab[ptr[1]] & ctype_xdigit) != 0 && (digitab[ptr[2]] & ctype_xdigit) != 0
&& (digitab[ptr[3]] & ctype_xdigit) != 0 && (digitab[ptr[4]] & ctype_xdigit) != 0)
{
- int i;
c = 0;
for (i = 0; i < 4; ++i)
{
@@ -864,7 +863,6 @@ else
Otherwise it is a lowercase x letter. */
if ((digitab[ptr[1]] & ctype_xdigit) != 0 && (digitab[ptr[2]] & ctype_xdigit) != 0)
{
- int i;
c = 0;
for (i = 0; i < 2; ++i)
{