diff options
author | Nikita Popov <nikita.ppv@gmail.com> | 2019-03-18 12:57:43 +0100 |
---|---|---|
committer | Nikita Popov <nikita.ppv@gmail.com> | 2019-03-18 16:58:48 +0100 |
commit | 2b9acd37f0a13572684dde80e3e56d5c1b2ec045 (patch) | |
tree | ca4e1541d2e998a2d0d4f9ed5ac5c71418edd45e /ext/pcre/tests | |
parent | 8c9d8c3f667e4cedc7499b49dcc52644dac17c53 (diff) | |
download | php-git-2b9acd37f0a13572684dde80e3e56d5c1b2ec045.tar.gz |
Fixed bug #72685
We currently have a large performance problem when implementing lexers
working on UTF-8 strings in PHP. This kind of code tends to perform a
large number of matches at different offsets on a single string. This
is generally fast. However, if /u mode is used, the full string will
be UTF-8 validated on each match. This results in quadratic runtime.
This patch fixes the issue by adding a IS_STR_VALID_UTF8 flag, which
is set when we have determined that the string is valid UTF8 and
further validation is skipped.
A limitation of this approach is that we can't set the flag for interned
strings. I think this is not a problem for this use-case which will
generally work on dynamic data. If we want to use this flag for other
purposes as well (mbstring?) then it might be worthwhile to UTF-8 validate
strings during interning. But right now this doesn't seem useful.
Diffstat (limited to 'ext/pcre/tests')
-rw-r--r-- | ext/pcre/tests/bug72685.phpt | 17 |
1 files changed, 17 insertions, 0 deletions
diff --git a/ext/pcre/tests/bug72685.phpt b/ext/pcre/tests/bug72685.phpt new file mode 100644 index 0000000000..7f6eabc182 --- /dev/null +++ b/ext/pcre/tests/bug72685.phpt @@ -0,0 +1,17 @@ +--TEST-- +Bug #72685: Same string is UTF-8 validated repeatedly +--FILE-- +<?php + +$input_size = 64 * 1024; +$str = str_repeat('a', $input_size); + +$start = microtime(true); +$pos = 0; +while (preg_match('/\G\w/u', $str, $m, 0, $pos)) ++$pos; +$end = microtime(true); +var_dump(($end - $start) < 0.5); // large margin, more like 0.05 in debug build + +?> +--EXPECT-- +bool(true) |