regcomp.c - rename NEXTOPER to REGNODE_AFTER and related logic

It is really easy to get confused about the difference between NEXTOPER() and regnext() of a regnode. The two concepts are related, similar, but importantly distinct. NEXTOPER() is also defined in such a way that it is easy to abuse and misunderstand and encourages producing code that is fragile to larger change, effectively "baking in" assumptions to the code that are difficult to discover by searching. Changing the type and storage requirements of a regnode may break things in subtle and hard to debug ways. An example of how NEXTOPER() is problematic is that this: NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the branch node", it means "jump forward by a regnode which happens to be two regnodes large". In other words NEXTOPER is just a fancy way of writing "node+1". This patch replaces NEXTOPER() with three new macros: REGNODE_AFTER_dynamic(node) REGNODE_AFTER_opcode(node,op) REGNODE_AFTER_type(node,tregnode_OPNAME) The first is the most generic case, it jumps forward by the size of the node, and determines that size by consulting OP(node). The second is where you have already extracted OP(node), and the third is where you know the actual structure that you want to jump forward by. Every regnode type has a corresponding type, which is known at compile time, so using the third will produce the most efficient code. However in many cases the code operates on one of several types, whose size may be the same now, but may change in the future, in which case one of the other forms is preferred. The run time logic in regexec.c should probably only use the REGNODE_AFTER_type() interface. Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(), which is used in a specific piece of legacy logic but should not be used otherwise. It is not safe to go backwards from an arbitrary node, we simply have no way to know how large the previous node is and thus where it starts. This patch includes some logic that validates assumptions during DEBUG mode which should catch errors from resizing regnodes. After this patch changing the size of an existing regnode should be relatively safe and errors related to sizing should trigger assertion fails. This patch includes changes to perlreguts.pod to explain this stuff better.
author: Yves Orton <demerphq@gmail.com> 2022-07-16 12:20:00 +0200
committer: Yves Orton <demerphq@gmail.com> 2022-08-03 11:07:09 +0200
commit: 0e48b698ea391da32a38a514b2361250e0bc2201 (patch)
tree: 0d1f2ab21ed3b2e4fa8f4ee6fffe04c0b6c1df75 /pod/perlreguts.pod
parent: ec5e6b1346dbbfc24682d87768357653663ef1eb (diff)
download: perl-0e48b698ea391da32a38a514b2361250e0bc2201.tar.gz
1 files changed, 100 insertions, 27 deletions
diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod
index 1f4bb02cdc..90147bd07e 100644
--- a/pod/perlreguts.pod
+++ b/pod/perlreguts.pod
@@ -216,39 +216,112 @@ types.
 
 =head3 What regop is next?
 
-There are three distinct concepts of "next" in the regex engine, and
-it is important to keep them clear.
+There are two distinct concepts of "next regnode" in the regex engine,
+and it is important to keep them distinct in your thinking as they
+overlap conceptually in many places, but where they don't overlap the
+difference is critical. For the majority of regnode types the two
+concepts are (nearly) identical in practice. The two types are
+C<REGNODE_AFTER> which is used heavily during compilation but only
+occasionally during execution and C<regnext> which is used heavily
+during execution, and only occasionally during compilation.
 
 =over 4
 
-=item *
-
-There is the "next regnode" from a given regnode, a value which is
-rarely useful except that sometimes it matches up in terms of value
-with one of the others, and that sometimes the code assumes this to
-always be so.
-
-=item *
-
-There is the "next regop" from a given regop/regnode. This is the
-regop physically located after the current one, as determined by
-the size of the current regop. This is often useful, such as when
-dumping the structure we use this order to traverse. Sometimes the code
-assumes that the "next regnode" is the same as the "next regop", or in
-other words assumes that the sizeof a given regop type is always going
-to be one regnode large.
-
-=item *
-
-There is the "regnext" from a given regop. This is the regop which
-is reached by jumping forward by the value of C<NEXT_OFF()>,
-or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1>
-structure. The subroutine C<regnext()> handles this transparently.
-This is the logical successor of the node, which in some cases, like
-that of the C<BRANCH> regop, has special meaning.
+=item "REGNODE_AFTER"
+
+This is the "positionally next regnode" in the compiled regex program.
+For the smaller regnode types it is C<regnode_ptr+1> under the hood, but
+as regnode sizes vary and can change over time we offer macros which
+hide the gory details.
+
+It is heavily used in the compiler phase but is only used by a few
+select regnode types in the execution phase. It is also heavily used in
+the code for dumping the regexp program for debugging.
+
+There are a selection of macros which can be used to compute this as
+efficiently as possible depending on the circumstances, for instance if
+the code is dedicated to handling a specific type of regnode you can use
+C<REGNODE_AFTER_type()> to produce the most efficient code. If you have
+already extracted the nodes opcode then you can use
+C<REGNODE_AFTER_opcode()>, and if you just have a pointer to a regnode you
+can use C<REGNODE_AFTER_dynamic()>.
+
+In older versions of the regex engine C<REGNODE_AFTER()> was called
+C<NEXTOPER> but this was found to be confusing and it was renamed. There
+is also a C<REGNODE_BEFORE()>, but it is unsafe and should not be used
+in new code.
+
+=item "regnext"
+
+This is the regnode which can be reached by jumping forward by the value
+of the C<NEXT_OFF()> member of the regnode, or in a few cases for longer
+jumps by the C<arg1> field of the C<regnode_1> structure. The subroutine
+C<regnext()> handles this transparently. In the majority of cases the
+C<regnext> for a regnode is the regnode which should be executed after the
+current one has successfully matched, but in some cases this may not be
+true. In loop control and branch control regnode types the regnext may
+signify something special, for BRANCH nodes C<regnext> is the
+next BRANCH that should be executed if the current one fails execution,
+and some loop control regnodes set the regnext to be the end of the loop
+so they can jump to their cleanup if the current iteration fails to match.
 
 =back
 
+Most regnode types do not create a branch in the execution flow, and
+leaving aside optimizations the two concepts of "next" are the same.
+For instance the C<regnext> and C<REGNODE_AFTER> of a SBOL opcode are
+the same during compilation phase. The main place this is not true is
+C<BRANCH> regnodes where the C<REGNODE_AFTER> represents the start of
+the pattern in the branch and the C<regnext> represents the linkage to
+the next BRANCH should this one fail to match, or 0 if it is the last
+branch. The looping logic for quantifiers also makes similar use of
+the distinction between the two types, with C<REGNODE_AFTER> being the
+inside of the loop construct, and the C<regnext> pointing at the end
+of the loop.
+
+During compilation the engine may not know what the regnext is for a
+given node, so during compilation C<regnext> is only used where it must
+be used and is known to be correct. At the very end of the compilation
+phase we walk the regex program and correct the regnext data as
+appropriate, and also perform various optimizations which may result in
+regnodes that were required during construction becoming redundant, or
+we may replace a large regnode with a much smaller one and filling in the
+gap with OPTIMIZED regnodes. Thus we might start with something like
+this:
+
+    BRANCH
+      EXACT "foo"
+    BRANCH
+      EXACT "bar"
+    EXACT "!"
+
+and replace it with something like:
+
+    TRIE foo|bar
+    OPTIMIZED
+    OPTIMIZED
+    OPTIMIZED
+    EXACT "!"
+
+the C<REGNODE_AFTER> for the C<TRIE> node would be an C<OPTIMIZED>
+regnode, and in theory the C<regnext> would be the same as the
+C<REGNODE_AFTER>. But it would be inefficient to execute the OPTIMIZED
+regnode as a noop three times, so the optimizer fixes the C<regnext> so
+such nodes are skipped during execution phase.
+
+During execution phases we use the C<regnext()> almost exclusively, and
+only use C<REGNODE_AFTER> in special cases where it has a well defined
+meaning for a given regnode type. For instance /x+/ results in
+
+    PLUS
+        EXACT "x"
+    END
+
+the C<regnext> of the C<PLUS> regnode is the C<END> regnode, and the
+C<REGNODE_AFTER> of the C<PLUS> regnode is the C<EXACT> regnode. The
+C<regnext> and C<REGNODE_AFTER> of the C<EXACT> regnode is the
+C<END> regnode.
+
 =head1 Process Overview
 
 Broadly speaking, performing a match of a string against a pattern
author	Yves Orton <demerphq@gmail.com>	2022-07-16 12:20:00 +0200
committer	Yves Orton <demerphq@gmail.com>	2022-08-03 11:07:09 +0200
commit	0e48b698ea391da32a38a514b2361250e0bc2201 (patch)
tree	0d1f2ab21ed3b2e4fa8f4ee6fffe04c0b6c1df75 /pod/perlreguts.pod
parent	ec5e6b1346dbbfc24682d87768357653663ef1eb (diff)
download	perl-0e48b698ea391da32a38a514b2361250e0bc2201.tar.gz