diff options
author | Karl Williamson <public@khwilliamson.com> | 2012-12-17 21:37:40 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2012-12-22 11:11:32 -0700 |
commit | 3018b823898645e44b8c37c70ac5c6302b031381 (patch) | |
tree | 0a26845e850bbc243726255ea67f9100c491d4ef /regcomp.sym | |
parent | 7aee35ffd7ab21d1007b7bacdc860c9b48f32758 (diff) | |
download | perl-3018b823898645e44b8c37c70ac5c6302b031381.tar.gz |
Consolidate some regex OPS
The regular rexpression operation POSIXA works on any of the (currently)
16 posix classes (like \w and [:graph:]) under the regex modifier /a.
This commit creates similar operations for the other modifiers: POSIXL
(for /l), POSIXD (for /d), POSIXU (for /u), plus their complements.
It causes these ops to be generated instead of the ALNUM, DIGIT,
HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net
saving is 22 regnode types.
The reason to do this is for maintenance. As of this commit, there are
now 22 fewer node types for which code has to be maintained. The code
for each variant was essentially the same logic, but on different
operands. It would be easy to make a change to one copy and forget to
make the corresponding change in the others. Indeed, this patch fixes
[perl #114272] in which one copy was out of sync with others.
This patch actually reduces the number of separate code paths to 5:
POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the
last 3 use the same code path as their non-complemented version, except
that a variable is initialized differently. The code then XORs this
variable with its result to do the complementing or not. Further, the
POSIXD branch now just checks if the target string being matched is
UTF-8 or not, and then jumps to either the POSIXU or POSIXA code
respectively. So, there are effectively only 4 cases that are coded:
POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry
about UTF-8, while NPOSIXA does, hence these for efficiency are coded
separately.)
Removing all this code saves memory. The output of the Linux size
command shows that the perl executable was shrunk by 33K bytes on my
platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2.
The reason this patch was doable was previous work in numbering the
POSIX classes, so that they could be indexed in arrays and bit
positions. This is a large patch; I didn't see how to break it into
smaller components.
I chose to make this code more efficient as opposed to saving even more
memory. Thus there is a separate loop that is jumped to after we know
we have to load a swash; this just saves having to test if the swash is
loaded each time through the loop. I avoid loading the swash until
absolutely necessary. In places in the previous version of this code,
the swash was loaded when the input was UTF-8, even if it wasn't yet
needed (and might never be if the input didn't contain anything above
Latin1); apparently to avoid the extra test per iteration.
The Perl test suite runs slightly faster on my platform with this patch
under -O0, and the speeds are indistinguishable under -O2. This is in
spite of these new POSIX regops being unknown to the regex optimizer
(this will be addressed in future commits), and extra machine
instructions being required for each character (the xor, and some
shifting and masking). I expect this is a result of better caching, and
not loading swashes unless absolutely necessary.
Diffstat (limited to 'regcomp.sym')
-rw-r--r-- | regcomp.sym | 54 |
1 files changed, 9 insertions, 45 deletions
diff --git a/regcomp.sym b/regcomp.sym index eb8ba46238..2a49d20379 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -36,8 +36,7 @@ SEOL EOL, no ; Same, assuming singleline. # modifiers have to ordered thusly: /d, /l, /u, /a, /aa. This is because code # in regcomp.c uses the enum value of the modifier as an offset from the /d # version. The complements must come after the non-complements. -# BOUND, ALNUM, SPACE, DIGIT, and their complements are affected, as well as -# EXACTF. +# BOUND, POSIX and their complements are affected, as well as EXACTF. BOUND BOUND, no ; Match "" at any word boundary using native charset semantics for non-utf8 BOUNDL BOUND, no ; Match "" at any locale word boundary BOUNDU BOUND, no ; Match "" at any word boundary using Unicode semantics @@ -56,44 +55,16 @@ SANY REG_ANY, no 0 S ; Match any one character. CANY REG_ANY, no 0 S ; Match any one byte. ANYOF ANYOF, sv 0 S ; Match character in (or not in) this class, single char match only -# Order (within each group) of the below is important. See ordering comment -# above. The PLACEHOLDERn ones are wasting a value. Right now, we have plenty -# to spare, but these would be obvious candidates if ever we ran out of node -# types in a U8. -ALNUM ALNUM, no 0 S ; Match any alphanumeric character using native charset semantics for non-utf8 -ALNUML ALNUM, no 0 S ; Match any alphanumeric char in locale -ALNUMU ALNUM, no 0 S ; Match any alphanumeric char using Unicode semantics -ALNUMA ALNUM, no 0 S ; Match [A-Za-z_0-9] -NALNUM NALNUM, no 0 S ; Match any non-alphanumeric character using native charset semantics for non-utf8 -NALNUML NALNUM, no 0 S ; Match any non-alphanumeric char in locale -NALNUMU NALNUM, no 0 S ; Match any non-alphanumeric char using Unicode semantics -NALNUMA NALNUM, no 0 S ; Match [^A-Za-z_0-9] -SPACE SPACE, no 0 S ; Match any whitespace character using native charset semantics for non-utf8 -SPACEL SPACE, no 0 S ; Match any whitespace char in locale -SPACEU SPACE, no 0 S ; Match any whitespace char using Unicode semantics -SPACEA SPACE, no 0 S ; Match [ \t\n\f\r] -NSPACE NSPACE, no 0 S ; Match any non-whitespace character using native charset semantics for non-utf8 -NSPACEL NSPACE, no 0 S ; Match any non-whitespace char in locale -NSPACEU NSPACE, no 0 S ; Match any non-whitespace char using Unicode semantics -NSPACEA NSPACE, no 0 S ; Match [^ \t\n\f\r] -DIGIT DIGIT, no 0 S ; Match any numeric character using native charset semantics for non-utf8 -DIGITL DIGIT, no 0 S ; Match any numeric character in locale -PLACEHOLDER1 NOTHING, no ; placeholder for missing DIGITU -DIGITA DIGIT, no 0 S ; Match [0-9] -NDIGIT NDIGIT, no 0 S ; Match any non-numeric character using native charset semantics for non-utf8 -NDIGITL NDIGIT, no 0 S ; Match any non-numeric character in locale -PLACEHOLDER2 NOTHING, no ; placeholder for missing NDIGITU -NDIGITA NDIGIT, no 0 S ; Match [^0-9] - -POSIXD POSIXD, none 0 S ; currently unused except as a placeholder -POSIXL POSIXD, none 0 S ; currently unused except as a placeholder -POSIXU POSIXD, none 0 S ; currently unused except as a placeholder +# Order of the below is important. See ordering comment above. +POSIXD POSIXD, none 0 S ; Some [[:class:]] under /d; the FLAGS field gives which one +POSIXL POSIXD, none 0 S ; Some [[:class:]] under /l; the FLAGS field gives which one +POSIXU POSIXD, none 0 S ; Some [[:class:]] under /u; the FLAGS field gives which one POSIXA POSIXD, none 0 S ; Some [[:class:]] under /a; the FLAGS field gives which one -NPOSIXD NPOSIXD, none 0 S ; currently unused except as a placeholder -NPOSIXL NPOSIXD, none 0 S ; currently unused except as a placeholder -NPOSIXU NPOSIXD, none 0 S ; currently unused except as a placeholder +NPOSIXD NPOSIXD, none 0 S ; complement of POSIXD, [[:^class:]] +NPOSIXL NPOSIXD, none 0 S ; complement of POSIXL, [[:^class:]] +NPOSIXU NPOSIXD, none 0 S ; complement of POSIXU, [[:^class:]] NPOSIXA NPOSIXD, none 0 S ; complement of POSIXA, [[:^class:]] -# End of order is important (within groups) +# End of order is important CLUMP CLUMP, no 0 V ; Match any extended grapheme cluster sequence @@ -237,13 +208,6 @@ KEEPS KEEPS, no ; $& begins here. #*New charclass like patterns LNBREAK LNBREAK, none ; generic newline pattern -# regcomp.c expects the node number of the complement to be one greater than -# the non-complement -VERTWS VERTWS, none 0 S ; vertical whitespace (Perl 6) -NVERTWS NVERTWS, none 0 S ; not vertical whitespace (Perl 6) -HORIZWS HORIZWS, none 0 S ; horizontal whitespace (Perl 6) -NHORIZWS NHORIZWS, none 0 S ; not horizontal whitespace (Perl 6) - # NEW STUFF SOMEWHERE ABOVE THIS LINE ################################################################################ |