diff options
author | Yves Orton <demerphq@gmail.com> | 2014-09-17 00:23:01 +0200 |
---|---|---|
committer | Yves Orton <demerphq@gmail.com> | 2014-09-17 04:47:34 +0200 |
commit | d3d47aac53402ea3d4836c60e3659dc927a9887c (patch) | |
tree | 90ce6aa3a324b6c2cf4c6b17b2b69f91c4239a7c /regcomp.sym | |
parent | e1919fe5716d96be44afc32406d9504bc70403de (diff) | |
download | perl-d3d47aac53402ea3d4836c60e3659dc927a9887c.tar.gz |
Eliminate the duplicative regops BOL and EOL
See also perl5porters thread titled: "Perl MBOLism in regex engine"
In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda)
the BOL regop was split into two behaviours MBOL and SBOL, with SBOL
and BOL behaving identically. Similarly the EOL regop was split into
two behaviors SEOL and MEOL, with EOL and SEOL behaving identically.
This then resulted in various duplicative code related to flags and
case statements in various parts of the regex engine.
It appears that perhaps BOL and EOL were kept because they are the
type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl
to handle aliases for the type data so that SBOL/MBOL are of type
BOL, even though BOL == SBOL seems to cover that case without adding
to the confusion.
This means two regops, a regstate, and an internal regex flag can
be removed (and used for other things), and various logic relating
to them can be removed.
For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and
MBOL is /^/m. (I consider it a fail we have no way to say MBOL without
the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is
also a /\z/ which is EOS "end of string" with or without the /m).
Diffstat (limited to 'regcomp.sym')
-rw-r--r-- | regcomp.sym | 34 |
1 files changed, 19 insertions, 15 deletions
diff --git a/regcomp.sym b/regcomp.sym index bea2a8e716..b285647086 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -24,15 +24,19 @@ END END, no ; End of program. SUCCEED END, no ; Return from a subroutine, basically. -#* Anchors: - -BOL BOL, no ; Match "" at beginning of line. -MBOL BOL, no ; Same, assuming multiline. -SBOL BOL, no ; Same, assuming singleline. -EOS EOL, no ; Match "" at end of string. -EOL EOL, no ; Match "" at end of line. -MEOL EOL, no ; Same, assuming multiline. -SEOL EOL, no ; Same, assuming singleline. +#* Line Start Anchors: +SBOL BOL, no ; Match "" at beginning of line: /^/, /\A/ +MBOL BOL, no ; Same, assuming multiline: /^/m + +#* Line End Anchors: +SEOL EOL, no ; Match "" at end of line: /$/ +MEOL EOL, no ; Same, assuming multiline: /$/m +EOS EOL, no ; Match "" at end of string: /\z/ + +#* Match Start Anchors: +GPOS GPOS, no ; Matches where last m//g left off. + +#* Word Boundary Opcodes: # The regops that have varieties that vary depending on the character set regex # modifiers have to ordered thusly: /d, /l, /u, /a, /aa. This is because code # in regcomp.c uses the enum value of the modifier as an offset from the /d @@ -47,15 +51,14 @@ NBOUND NBOUND, no ; Match "" at any word non-boundary using nati NBOUNDL NBOUND, no ; Match "" at any locale word non-boundary NBOUNDU NBOUND, no ; Match "" at any word non-boundary using Unicode rules NBOUNDA NBOUND, no ; Match "" at any word non-boundary using ASCII rules -GPOS GPOS, no ; Matches where last m//g left off. #* [Special] alternatives: - REG_ANY REG_ANY, no 0 S ; Match any one character (except newline). SANY REG_ANY, no 0 S ; Match any one character. CANY REG_ANY, no 0 S ; Match any one byte. ANYOF ANYOF, sv 0 S ; Match character in (or not in) this class, single char match only +#* POSIX Character Classes: # Order of the below is important. See ordering comment above. POSIXD POSIXD, none 0 S ; Some [[:class:]] under /d; the FLAGS field gives which one POSIXL POSIXD, none 0 S ; Some [[:class:]] under /l; the FLAGS field gives which one @@ -147,16 +150,17 @@ NREFFL REF, no-sv 1 V ; Match already matched string, folded in loc. NREFFU REF, num 1 V ; Match already matched string, folded using unicode rules for non-utf8 NREFFA REF, num 1 V ; Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII +#*Support for long RE +LONGJMP LONGJMP, off 1 . 1 ; Jump far away. +BRANCHJ BRANCHJ, off 1 V 1 ; BRANCH with long offset. + +#*Special Case Regops IFMATCH BRANCHJ, off 1 . 2 ; Succeeds if the following matches. UNLESSM BRANCHJ, off 1 . 2 ; Fails if the following matches. SUSPEND BRANCHJ, off 1 V 1 ; "Independent" sub-RE. IFTHEN BRANCHJ, off 1 V 1 ; Switch, should be preceded by switcher. GROUPP GROUPP, num 1 ; Whether the group matched. -#*Support for long RE - -LONGJMP LONGJMP, off 1 . 1 ; Jump far away. -BRANCHJ BRANCHJ, off 1 V 1 ; BRANCH with long offset. #*The heavy worker |