summaryrefslogtreecommitdiff
path: root/ext/pcre/tests
diff options
context:
space:
mode:
authorNikita Popov <nikita.ppv@gmail.com>2019-03-18 12:57:43 +0100
committerNikita Popov <nikita.ppv@gmail.com>2019-03-18 16:58:48 +0100
commit2b9acd37f0a13572684dde80e3e56d5c1b2ec045 (patch)
treeca4e1541d2e998a2d0d4f9ed5ac5c71418edd45e /ext/pcre/tests
parent8c9d8c3f667e4cedc7499b49dcc52644dac17c53 (diff)
downloadphp-git-2b9acd37f0a13572684dde80e3e56d5c1b2ec045.tar.gz
Fixed bug #72685
We currently have a large performance problem when implementing lexers working on UTF-8 strings in PHP. This kind of code tends to perform a large number of matches at different offsets on a single string. This is generally fast. However, if /u mode is used, the full string will be UTF-8 validated on each match. This results in quadratic runtime. This patch fixes the issue by adding a IS_STR_VALID_UTF8 flag, which is set when we have determined that the string is valid UTF8 and further validation is skipped. A limitation of this approach is that we can't set the flag for interned strings. I think this is not a problem for this use-case which will generally work on dynamic data. If we want to use this flag for other purposes as well (mbstring?) then it might be worthwhile to UTF-8 validate strings during interning. But right now this doesn't seem useful.
Diffstat (limited to 'ext/pcre/tests')
-rw-r--r--ext/pcre/tests/bug72685.phpt17
1 files changed, 17 insertions, 0 deletions
diff --git a/ext/pcre/tests/bug72685.phpt b/ext/pcre/tests/bug72685.phpt
new file mode 100644
index 0000000000..7f6eabc182
--- /dev/null
+++ b/ext/pcre/tests/bug72685.phpt
@@ -0,0 +1,17 @@
+--TEST--
+Bug #72685: Same string is UTF-8 validated repeatedly
+--FILE--
+<?php
+
+$input_size = 64 * 1024;
+$str = str_repeat('a', $input_size);
+
+$start = microtime(true);
+$pos = 0;
+while (preg_match('/\G\w/u', $str, $m, 0, $pos)) ++$pos;
+$end = microtime(true);
+var_dump(($end - $start) < 0.5); // large margin, more like 0.05 in debug build
+
+?>
+--EXPECT--
+bool(true)