diff options
author | Karl Williamson <khw@cpan.org> | 2014-09-22 13:59:39 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2014-09-29 13:07:07 -0600 |
commit | b35552de5cea8eb47ccb046284ecb9a099430255 (patch) | |
tree | 9cddbc9de1b38404bcf6fdae9e65f46b5a5d3e79 /regen | |
parent | dea37815c59831b7e586fa51968348fbb8009e1a (diff) | |
download | perl-b35552de5cea8eb47ccb046284ecb9a099430255.tar.gz |
Tighten uses of regex synthetic start class
A synthetic start class (SSC) is generated by the regular expression
pattern compiler to give a consolidation of all the possible things that
can match at the beginning of where a pattern can possibly match.
For example
qr/a?bfoo/;
requires the match to begin with either an 'a' or a 'b'. There are no
other possibilities. We can set things up to quickly scan for either of
these in the target string, and only when one of these is found do we
need to look for 'foo'.
There is an overhead associated with using SSCs. If the number of
possibilities that the SSC excludes is relatively small, it can be
counter-productive to use them.
This patch creates a crude sieve to decide whether to use an SSC or not.
If the SSC doesn't exclude at least half the "likely" possiblities, it
is discarded. This patch is a starting point, and can be refined if
necessary as we gain experience.
See thread beginning with
http://nntp.perl.org/group/perl.perl5.porters/212644
In many patterns, no SSC is generated; and with the advent of tries,
SSC's have become less important, so whatever we do is not terribly
critical.
Diffstat (limited to 'regen')
-rw-r--r-- | regen/unicode_constants.pl | 15 |
1 files changed, 15 insertions, 0 deletions
diff --git a/regen/unicode_constants.pl b/regen/unicode_constants.pl index c81f7676d2..936c1a8a6c 100644 --- a/regen/unicode_constants.pl +++ b/regen/unicode_constants.pl @@ -155,7 +155,22 @@ foreach my $charset (get_supported_code_pages()) { printf $out_fh "# define MAX_PRINT_A_FOR_USE_ONLY_BY_REGCOMP_DOT_C 0x%02X /* The max code point that isPRINT_A */\n", $max_PRINT_A; print $out_fh "\n" . get_conditional_compile_line_end(); + +} + +use Unicode::UCD 'prop_invlist'; + +my $count = 0; +my @other_invlist = prop_invlist("Other"); +for (my $i = 0; $i < @other_invlist; $i += 2) { + $count += ((defined $other_invlist[$i+1]) + ? $other_invlist[$i+1] + : 0x110000) + - $other_invlist[$i]; } +printf $out_fh "\n/* The number of code points not matching \\pC */\n" + . "#define NON_OTHER_COUNT_FOR_USE_ONLY_BY_REGCOMP_DOT_C %d\n", + 0x110000 - $count; print $out_fh "\n#endif /* H_UNICODE_CONSTANTS */\n"; |