Merge branch 'merge-pcre' into 10.1

author: Sergei Golubchik <serg@mariadb.org> 2019-04-26 16:11:55 +0200
committer: Sergei Golubchik <serg@mariadb.org> 2019-04-26 16:11:55 +0200
commit: 52eb4f172cc85bf2c9dcc515f747d9bd23089c65 (patch)
tree: 38b27016b279cc5982f99772bff5df1e981cfb93 /pcre
parent: 1389c94b3c33793a3c7f56b5e238e29e7c71d998 (diff)
parent: 879f7e85aa08dda613ea2f481e53392da4864741 (diff)
download: mariadb-git-52eb4f172cc85bf2c9dcc515f747d9bd23089c65.tar.gz
16 files changed, 227 insertions, 27 deletions
diff --git a/pcre/AUTHORS b/pcre/AUTHORS
index eb9b1a44b34..23c005a33d6 100644
--- a/pcre/AUTHORS
+++ b/pcre/AUTHORS
@@ -8,7 +8,7 @@ Email domain:     cam.ac.uk
 University of Cambridge Computing Service,
 Cambridge, England.
 
-Copyright (c) 1997-2018 University of Cambridge
+Copyright (c) 1997-2019 University of Cambridge
 All rights reserved
 
 
@@ -19,7 +19,7 @@ Written by:       Zoltan Herczeg
 Email local part: hzmester
 Emain domain:     freemail.hu
 
-Copyright(c) 2010-2018 Zoltan Herczeg
+Copyright(c) 2010-2019 Zoltan Herczeg
 All rights reserved.
 
 
@@ -30,7 +30,7 @@ Written by:       Zoltan Herczeg
 Email local part: hzmester
 Emain domain:     freemail.hu
 
-Copyright(c) 2009-2018 Zoltan Herczeg
+Copyright(c) 2009-2019 Zoltan Herczeg
 All rights reserved.
 
 
diff --git a/pcre/ChangeLog b/pcre/ChangeLog
index 7b53195f6a6..e4d2d9fa24c 100644
--- a/pcre/ChangeLog
+++ b/pcre/ChangeLog
@@ -5,6 +5,49 @@ Note that the PCRE 8.xx series (PCRE1) is now in a bugfix-only state. All
 development is happening in the PCRE2 10.xx series.
 
 
+Version 8.43 23-February-2019
+-----------------------------
+
+1. Some time ago the config macro SUPPORT_UTF8 was changed to SUPPORT_UTF
+because it also applies to UTF-16 and UTF-32. However, this change was not made
+in the pcre2cpp files; consequently the C++ wrapper has from then been compiled
+with a bug in it, which would have been picked up by the unit test except that
+it also had its UTF8 code cut out. The bug was in a global replace when moving
+forward after matching an empty string.
+
+2. The C++ wrapper got broken a long time ago (version 7.3, August 2007) when
+(*CR) was invented (assuming it was the first such start-of-pattern option).
+The wrapper could never handle such patterns because it wraps patterns in
+(?:...)\z in order to support end anchoring. I have hacked in some code to fix
+this, that is, move the wrapping till after any existing start-of-pattern
+special settings.
+
+3. "pcre2grep" (sic) was accidentally mentioned in an error message (fix was
+ported from PCRE2).
+
+4. Typo LCC_ALL for LC_ALL fixed in pcregrep.
+
+5. In a pattern such as /[^\x{100}-\x{ffff}]*[\x80-\xff]/ which has a repeated
+negative class with no characters less than 0x100 followed by a positive class
+with only characters less than 0x100, the first class was incorrectly being
+auto-possessified, causing incorrect match failures.
+
+6. If the only branch in a conditional subpattern was anchored, the whole
+subpattern was treated as anchored, when it should not have been, since the
+assumed empty second branch cannot be anchored. Demonstrated by test patterns
+such as /(?(1)^())b/ or /(?(?=^))b/.
+
+7. Fix subject buffer overread in JIT when UTF is disabled and \X or \R has
+a greater than 1 fixed quantifier. This issue was found by Yunho Kim.
+
+8. If a pattern started with a subroutine call that had a quantifier with a
+minimum of zero, an incorrect "match must start with this character" could be
+recorded. Example: /(?&xxx)*ABC(?<xxx>XYZ)/ would (incorrectly) expect 'A' to
+be the first character of a match.
+
+9. Improve MAP_JIT flag usage on MacOS. Patch by Rich Siegel.
+
+
 Version 8.42 20-March-2018
 --------------------------
 
diff --git a/pcre/LICENCE b/pcre/LICENCE
index f6ef7fd7664..760a6666b60 100644
--- a/pcre/LICENCE
+++ b/pcre/LICENCE
@@ -25,7 +25,7 @@ Email domain:     cam.ac.uk
 University of Cambridge Computing Service,
 Cambridge, England.
 
-Copyright (c) 1997-2018 University of Cambridge
+Copyright (c) 1997-2019 University of Cambridge
 All rights reserved.
 
 
@@ -34,9 +34,9 @@ PCRE JUST-IN-TIME COMPILATION SUPPORT
 
 Written by:       Zoltan Herczeg
 Email local part: hzmester
-Emain domain:     freemail.hu
+Email domain:     freemail.hu
 
-Copyright(c) 2010-2018 Zoltan Herczeg
+Copyright(c) 2010-2019 Zoltan Herczeg
 All rights reserved.
 
 
@@ -45,9 +45,9 @@ STACK-LESS JUST-IN-TIME COMPILER
 
 Written by:       Zoltan Herczeg
 Email local part: hzmester
-Emain domain:     freemail.hu
+Email domain:     freemail.hu
 
-Copyright(c) 2009-2018 Zoltan Herczeg
+Copyright(c) 2009-2019 Zoltan Herczeg
 All rights reserved.
 
 
diff --git a/pcre/NEWS b/pcre/NEWS
index 09b4ad36003..0f184081740 100644
--- a/pcre/NEWS
+++ b/pcre/NEWS
@@ -1,6 +1,16 @@
 News about PCRE releases
 ------------------------
 
+Note that this library (now called PCRE1) is now being maintained for bug fixes
+only. New projects are advised to use the new PCRE2 libraries.
+
+
+Release 8.43 23-February-2019
+-----------------------------
+
+This is a bug-fix release.
+
+
 Release 8.42 20-March-2018
 --------------------------
 
diff --git a/pcre/configure.ac b/pcre/configure.ac
index dcdef6a9427..d2e5236cbd6 100644
--- a/pcre/configure.ac
+++ b/pcre/configure.ac
@@ -9,17 +9,17 @@ dnl The PCRE_PRERELEASE feature is for identifying release candidates. It might
 dnl be defined as -RC2, for example. For real releases, it should be empty.
 
 m4_define(pcre_major, [8])
-m4_define(pcre_minor, [42])
+m4_define(pcre_minor, [43])
 m4_define(pcre_prerelease, [])
-m4_define(pcre_date, [2018-03-20])
+m4_define(pcre_date, [2019-02-23])
 
 # NOTE: The CMakeLists.txt file searches for the above variables in the first
 # 50 lines of this file. Please update that if the variables above are moved.
 
 # Libtool shared library interface versions (current:revision:age)
-m4_define(libpcre_version, [3:10:2])
-m4_define(libpcre16_version, [2:10:2])
-m4_define(libpcre32_version, [0:10:0])
+m4_define(libpcre_version, [3:11:2])
+m4_define(libpcre16_version, [2:11:2])
+m4_define(libpcre32_version, [0:11:0])
 m4_define(libpcreposix_version, [0:6:0])
 m4_define(libpcrecpp_version, [0:1:0])
 
diff --git a/pcre/pcre_compile.c b/pcre/pcre_compile.c
index 9b9da46f0d0..734875de2fb 100644
--- a/pcre/pcre_compile.c
+++ b/pcre/pcre_compile.c
@@ -6,7 +6,7 @@
 and semantics are as close as possible to those of the Perl 5 language.
 
                        Written by Philip Hazel
-           Copyright (c) 1997-2016 University of Cambridge
+           Copyright (c) 1997-2018 University of Cambridge
 
 -----------------------------------------------------------------------------
 Redistribution and use in source and binary forms, with or without
@@ -3300,7 +3300,7 @@ for(;;)
       if ((*xclass_flags & XCL_MAP) == 0)
         {
         /* No bits are set for characters < 256. */
-        if (list[1] == 0) return TRUE;
+        if (list[1] == 0) return (*xclass_flags & XCL_NOT) == 0;
         /* Might be an empty repeat. */
         continue;
         }
@@ -7645,6 +7645,8 @@ for (;; ptr++)
         /* Can't determine a first byte now */
 
         if (firstcharflags == REQ_UNSET) firstcharflags = REQ_NONE;
+        zerofirstchar = firstchar;
+        zerofirstcharflags = firstcharflags;
         continue;
 
 
@@ -8685,10 +8687,18 @@ do {
      if (!is_anchored(scode, new_map, cd, atomcount)) return FALSE;
      }
 
-   /* Positive forward assertions and conditions */
+   /* Positive forward assertion */
 
-   else if (op == OP_ASSERT || op == OP_COND)
+   else if (op == OP_ASSERT)
+     {
+     if (!is_anchored(scode, bracket_map, cd, atomcount)) return FALSE;
+     }
+
+   /* Condition; not anchored if no second branch */
+
+   else if (op == OP_COND)
      {
+     if (scode[GET(scode,1)] != OP_ALT) return FALSE;
      if (!is_anchored(scode, bracket_map, cd, atomcount)) return FALSE;
      }
 
diff --git a/pcre/pcre_jit_compile.c b/pcre/pcre_jit_compile.c
index 2bad74b0231..bc5f9c01433 100644
--- a/pcre/pcre_jit_compile.c
+++ b/pcre/pcre_jit_compile.c
@@ -9002,7 +9002,7 @@ if (exact > 1)
 #ifdef SUPPORT_UTF
       && !common->utf
 #endif
-      )
+      && type != OP_ANYNL && type != OP_EXTUNI)
     {
     OP2(SLJIT_ADD, TMP1, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(exact));
     add_jump(compiler, &backtrack->topbacktracks, CMP(SLJIT_GREATER, TMP1, 0, STR_END, 0));
diff --git a/pcre/pcrecpp.cc b/pcre/pcrecpp.cc
index d09c9abc516..77a2fedc4be 100644
--- a/pcre/pcrecpp.cc
+++ b/pcre/pcrecpp.cc
@@ -80,6 +80,24 @@ static const string empty_string;
 // If the user doesn't ask for any options, we just use this one
 static RE_Options default_options;
 
+// Specials for the start of patterns. See comments where start_options is used
+// below. (PH June 2018)
+static const char *start_options[] = {
+  "(*UTF8)",
+  "(*UTF)",
+  "(*UCP)",
+  "(*NO_START_OPT)",
+  "(*NO_AUTO_POSSESS)",
+  "(*LIMIT_RECURSION=",
+  "(*LIMIT_MATCH=",
+  "(*CRLF)",
+  "(*CR)",
+  "(*BSR_UNICODE)",
+  "(*BSR_ANYCRLF)",
+  "(*ANYCRLF)",
+  "(*ANY)",
+  "" };
+
 void RE::Init(const string& pat, const RE_Options* options) {
   pattern_ = pat;
   if (options == NULL) {
@@ -135,7 +153,49 @@ pcre* RE::Compile(Anchor anchor) {
   } else {
     // Tack a '\z' at the end of RE.  Parenthesize it first so that
     // the '\z' applies to all top-level alternatives in the regexp.
-    string wrapped = "(?:";  // A non-counting grouping operator
+
+    /* When this code was written (for PCRE 6.0) it was enough just to
+    parenthesize the entire pattern. Unfortunately, when the feature of
+    starting patterns with (*UTF8) or (*CR) etc. was added to PCRE patterns,
+    this code was never updated. This bug was not noticed till 2018, long after
+    PCRE became obsolescent and its maintainer no longer around. Since PCRE is
+    frozen, I have added a hack to check for all the existing "start of
+    pattern" specials - knowing that no new ones will ever be added. I am not a
+    C++ programmer, so the code style is no doubt crude. It is also
+    inefficient, but is only run when the pattern starts with "(*".
+    PH June 2018. */
+
+    string wrapped = "";
+
+    if (pattern_.c_str()[0] == '(' && pattern_.c_str()[1] == '*') {
+      int kk, klen, kmat;
+      for (;;) {   // Loop for any number of leading items
+
+        for (kk = 0; start_options[kk][0] != 0; kk++) {
+          klen = strlen(start_options[kk]);
+          kmat = strncmp(pattern_.c_str(), start_options[kk], klen);
+          if (kmat >= 0) break;
+        }
+        if (kmat != 0) break;  // Not found
+
+        // If the item ended in "=" we must copy digits up to ")".
+
+        if (start_options[kk][klen-1] == '=') {
+          while (isdigit(pattern_.c_str()[klen])) klen++;
+          if (pattern_.c_str()[klen] != ')') break;  // Syntax error
+          klen++;
+        }
+
+        // Move the item from the pattern to the start of the wrapped string.
+
+        wrapped += pattern_.substr(0, klen);
+        pattern_.erase(0, klen);
+      }
+    }
+
+    // Wrap the rest of the pattern.
+
+    wrapped += "(?:";  // A non-counting grouping operator
     wrapped += pattern_;
     wrapped += ")\\z";
     re = pcre_compile(wrapped.c_str(), pcre_options,
@@ -415,7 +475,7 @@ int RE::GlobalReplace(const StringPiece& rewrite,
           matchend++;
         }
         // We also need to advance more than one char if we're in utf8 mode.
-#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF
         if (options_.utf8()) {
           while (matchend < static_cast<int>(str->length()) &&
                  ((*str)[matchend] & 0xc0) == 0x80)
diff --git a/pcre/pcrecpp_unittest.cc b/pcre/pcrecpp_unittest.cc
index 4b15fbef1c3..1fc01a042b3 100644
--- a/pcre/pcrecpp_unittest.cc
+++ b/pcre/pcrecpp_unittest.cc
@@ -309,7 +309,7 @@ static void TestReplace() {
       "@aa",
       "@@@",
       3 },
-#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF
     { "b*",
       "bb",
       "\xE3\x83\x9B\xE3\x83\xBC\xE3\x83\xA0\xE3\x81\xB8",   // utf8
@@ -327,7 +327,7 @@ static void TestReplace() {
     { "", NULL, NULL, NULL, NULL, 0 }
   };
 
-#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF
   const bool support_utf8 = true;
 #else
   const bool support_utf8 = false;
@@ -535,7 +535,7 @@ static void TestQuoteMetaLatin1() {
 }
 
 static void TestQuoteMetaUtf8() {
-#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF
   TestQuoteMeta("Pl\xc3\xa1\x63ido Domingo", pcrecpp::UTF8());
   TestQuoteMeta("xyz", pcrecpp::UTF8());            // No fancy utf8
   TestQuoteMeta("\xc2\xb0", pcrecpp::UTF8());       // 2-byte utf8 (degree symbol)
@@ -1178,7 +1178,7 @@ int main(int argc, char** argv) {
     CHECK(re.error().empty());  // Must have no error
   }
 
-#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF
   // Check UTF-8 handling
   {
     printf("Testing UTF-8 handling\n");
@@ -1203,6 +1203,30 @@ int main(int argc, char** argv) {
     RE re_test2("...", pcrecpp::UTF8());
     CHECK(re_test2.FullMatch(utf8_string));
 
+    // PH added these tests for leading option settings
+
+    RE re_testZ0("(*CR)(*NO_START_OPT).........");
+    CHECK(re_testZ0.FullMatch(utf8_string));
+
+#ifdef SUPPORT_UTF
+    RE re_testZ1("(*UTF8)...");
+    CHECK(re_testZ1.FullMatch(utf8_string));
+
+    RE re_testZ2("(*UTF)...");
+    CHECK(re_testZ2.FullMatch(utf8_string));
+
+#ifdef SUPPORT_UCP
+    RE re_testZ3("(*UCP)(*UTF)...");
+    CHECK(re_testZ3.FullMatch(utf8_string));
+
+    RE re_testZ4("(*UCP)(*LIMIT_MATCH=1000)(*UTF)...");
+    CHECK(re_testZ4.FullMatch(utf8_string));
+
+    RE re_testZ5("(*UCP)(*LIMIT_MATCH=1000)(*ANY)(*UTF)...");
+    CHECK(re_testZ5.FullMatch(utf8_string));
+#endif
+#endif
+
     // Check that '.' matches one byte or UTF-8 character
     // according to the mode.
     string ss;
@@ -1248,7 +1272,7 @@ int main(int argc, char** argv) {
     CHECK(!match_sentence.FullMatch(target));
     CHECK(!match_sentence_re.FullMatch(target));
   }
-#endif  /* def SUPPORT_UTF8 */
+#endif  /* def SUPPORT_UTF */
 
   printf("Testing error reporting\n");
 
diff --git a/pcre/pcregrep.c b/pcre/pcregrep.c
index a406be962d7..5982406862b 100644
--- a/pcre/pcregrep.c
+++ b/pcre/pcregrep.c
@@ -2252,7 +2252,7 @@ if (isdirectory(pathname))
       int fnlength = strlen(pathname) + strlen(nextfile) + 2;
       if (fnlength > 2048)
         {
-        fprintf(stderr, "pcre2grep: recursive filename is too long\n");
+        fprintf(stderr, "pcregrep: recursive filename is too long\n");
         rc = 2;
         break;
         }
@@ -3034,7 +3034,7 @@ LC_ALL environment variable is set, and if so, use it. */
 if (locale == NULL)
   {
   locale = getenv("LC_ALL");
-  locale_from = "LCC_ALL";
+  locale_from = "LC_ALL";
   }
 
 if (locale == NULL)
diff --git a/pcre/testdata/testinput1 b/pcre/testdata/testinput1
index 5c23f41fa81..02e4f4825fc 100644
--- a/pcre/testdata/testinput1
+++ b/pcre/testdata/testinput1
@@ -5742,4 +5742,19 @@ AbcdCBefgBhiBqz
 /X+(?#comment)?/
     >XXX<
 
+/   (?<word> \w+ )*    \.   /xi
+    pokus.
+    
+/(?(DEFINE) (?<word> \w+ ) ) (?&word)*   \./xi
+    pokus.
+
+/(?(DEFINE) (?<word> \w+ ) ) ( (?&word)* )   \./xi 
+    pokus.
+
+/(?&word)*  (?(DEFINE) (?<word> \w+ ) )  \./xi
+    pokus.
+
+/(?&word)*  \. (?<word> \w+ )/xi
+    pokus.hokus
+
 /-- End of testinput1 --/
diff --git a/pcre/testdata/testinput2 b/pcre/testdata/testinput2
index 8ba4dc4ddab..3528de153eb 100644
--- a/pcre/testdata/testinput2
+++ b/pcre/testdata/testinput2
@@ -4257,4 +4257,7 @@ backtracking verbs. --/
     ab
     aaab 
 
+/(?(?=^))b/
+    abc
+
 /-- End of testinput2 --/
diff --git a/pcre/testdata/testinput4 b/pcre/testdata/testinput4
index 8bdbdac4c26..63368c0a097 100644
--- a/pcre/testdata/testinput4
+++ b/pcre/testdata/testinput4
@@ -727,4 +727,7 @@
 /\C(\W?ſ)'?{{/8
     \\C(\\W?ſ)'?{{
 
+/[^\x{100}-\x{ffff}]*[\x80-\xff]/8
+    \x{99}\x{99}\x{99}
+
 /-- End of testinput4 --/
diff --git a/pcre/testdata/testoutput1 b/pcre/testdata/testoutput1
index eff8ecc948c..e6147e60b95 100644
--- a/pcre/testdata/testoutput1
+++ b/pcre/testdata/testoutput1
@@ -9446,4 +9446,28 @@ No match
     >XXX<
  0: X
 
+/   (?<word> \w+ )*    \.   /xi
+    pokus.
+ 0: pokus.
+ 1: pokus
+    
+/(?(DEFINE) (?<word> \w+ ) ) (?&word)*   \./xi
+    pokus.
+ 0: pokus.
+
+/(?(DEFINE) (?<word> \w+ ) ) ( (?&word)* )   \./xi 
+    pokus.
+ 0: pokus.
+ 1: <unset>
+ 2: pokus
+
+/(?&word)*  (?(DEFINE) (?<word> \w+ ) )  \./xi
+    pokus.
+ 0: pokus.
+
+/(?&word)*  \. (?<word> \w+ )/xi
+    pokus.hokus
+ 0: pokus.hokus
+ 1: hokus
+
 /-- End of testinput1 --/
diff --git a/pcre/testdata/testoutput2 b/pcre/testdata/testoutput2
index 61ed8d9d4e4..4ccda272010 100644
--- a/pcre/testdata/testoutput2
+++ b/pcre/testdata/testoutput2
@@ -14721,4 +14721,8 @@ No need char
  0: ab
  1: a
 
+/(?(?=^))b/
+    abc
+ 0: b
+
 /-- End of testinput2 --/
diff --git a/pcre/testdata/testoutput4 b/pcre/testdata/testoutput4
index d43c12392dd..69e812cd357 100644
--- a/pcre/testdata/testoutput4
+++ b/pcre/testdata/testoutput4
@@ -1277,4 +1277,8 @@ No match
     \\C(\\W?ſ)'?{{
 No match
 
+/[^\x{100}-\x{ffff}]*[\x80-\xff]/8
+    \x{99}\x{99}\x{99}
+ 0: \x{99}\x{99}\x{99}
+
 /-- End of testinput4 --/
author	Sergei Golubchik <serg@mariadb.org>	2019-04-26 16:11:55 +0200
committer	Sergei Golubchik <serg@mariadb.org>	2019-04-26 16:11:55 +0200
commit	52eb4f172cc85bf2c9dcc515f747d9bd23089c65 (patch)
tree	38b27016b279cc5982f99772bff5df1e981cfb93 /pcre
parent	1389c94b3c33793a3c7f56b5e238e29e7c71d998 (diff)
parent	879f7e85aa08dda613ea2f481e53392da4864741 (diff)
download	mariadb-git-52eb4f172cc85bf2c9dcc515f747d9bd23089c65.tar.gz