summaryrefslogtreecommitdiff
path: root/HACKING
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-10-10 16:24:11 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-10-10 16:24:11 +0000
commita0ab6e7fdaff6d0b459689f4bc5c79792cfd85cd (patch)
tree5f5c6f147d3d5d3c2c06d64cfe7d6ac413991c59 /HACKING
parentc9a0ce9d1827d3416a549f1fe8ba46f081ccfec2 (diff)
downloadpcre-a0ab6e7fdaff6d0b459689f4bc5c79792cfd85cd.tar.gz
Fix problem with (*THEN) not backing up far enough.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@550 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'HACKING')
-rw-r--r--HACKING34
1 files changed, 26 insertions, 8 deletions
diff --git a/HACKING b/HACKING
index 8c6a42d..ee09132 100644
--- a/HACKING
+++ b/HACKING
@@ -4,6 +4,7 @@ Technical Notes about PCRE
These are very rough technical notes that record potentially useful information
about PCRE internals.
+
Historical note 1
-----------------
@@ -22,6 +23,7 @@ the one matching the longest subset of the subject string was chosen. This did
not necessarily maximize the individual wild portions of the pattern, as is
expected in Unix and Perl-style regular expressions.
+
Historical note 2
-----------------
@@ -34,6 +36,7 @@ maximizing (or, optionally, minimizing in Perl) the amount of the subject that
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.
+
OK, here's the real stuff
-------------------------
@@ -44,6 +47,7 @@ in the pattern, to save on compiling time. However, because of the greater
complexity in Perl regular expressions, I couldn't do this. In any case, a
first pass through the pattern is helpful for other reasons.
+
Computing the memory requirement: how it was
--------------------------------------------
@@ -54,6 +58,7 @@ idea was that this would turn out faster than the Henry Spencer code because
the first pass is degenerate and the second pass can just store stuff straight
into the vector, which it knows is big enough.
+
Computing the memory requirement: how it is
-------------------------------------------
@@ -75,6 +80,7 @@ runs more slowly than before (30% or more, depending on the pattern) because it
is doing a full analysis of the pattern. My hope was that this would not be a
big issue, and in the event, nobody has commented on it.
+
Traditional matching function
-----------------------------
@@ -84,6 +90,7 @@ and the way that Perl works. This is not surprising, since it is intended to be
as compatible with Perl as possible. This is the function most users of PCRE
will use most of the time.
+
Supplementary matching function
-------------------------------
@@ -119,7 +126,6 @@ quantifiers) are always just two bytes long.
A list of the opcodes follows:
-
Opcodes with no following data
------------------------------
@@ -151,12 +157,24 @@ These items are all just one byte long
OP_EXTUNI match an extended Unicode character
OP_ANYNL match any Unicode newline sequence
- OP_ACCEPT ) These are Perl 5.10's "backtracking
- OP_COMMIT ) control verbs". If OP_ACCEPT is inside
- OP_FAIL ) capturing parentheses, it may be preceded
- OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
- OP_SKIP ) number, indicating which parentheses must be
- OP_THEN ) closed.
+ OP_ACCEPT ) These are Perl 5.10's "backtracking control
+ OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
+ OP_FAIL ) parentheses, it may be preceded by one or more
+ OP_PRUNE ) OP_CLOSE, followed by a 2-byte number,
+ OP_SKIP ) indicating which parentheses must be closed.
+
+
+Backtracking control verbs with data
+------------------------------------
+
+OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
+start of the current branch.
+
+OP_MARK is followed by the mark name, preceded by a one-byte length, and
+followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
+the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
+two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
+offset value.
Repeating single characters
@@ -419,4 +437,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
-October 2009
+October 2010