summaryrefslogtreecommitdiff
path: root/srclib/pcre/doc/Tech.Notes
diff options
context:
space:
mode:
Diffstat (limited to 'srclib/pcre/doc/Tech.Notes')
-rw-r--r--srclib/pcre/doc/Tech.Notes27
1 files changed, 19 insertions, 8 deletions
diff --git a/srclib/pcre/doc/Tech.Notes b/srclib/pcre/doc/Tech.Notes
index 03904db3cc..f5ca280115 100644
--- a/srclib/pcre/doc/Tech.Notes
+++ b/srclib/pcre/doc/Tech.Notes
@@ -135,7 +135,7 @@ end of each byte.
Back references
---------------
-OP_REF is followed by a single byte containing the reference number.
+OP_REF is followed by two bytes containing the reference number.
Repeating character classes and back references
@@ -163,11 +163,21 @@ Brackets and alternation
A pair of non-capturing (round) brackets is wrapped round each expression at
compile time, so alternation always happens in the context of brackets.
+
Non-capturing brackets use the opcode OP_BRA, while capturing brackets use
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English
speakers, including myself, can be round, square, curly, or pointy. Hence this
usage.]
+Originally PCRE was limited to 99 capturing brackets (so as not to use up all
+the opcodes). From release 3.5, there is no limit. What happens is that the
+first ones, up to EXTRACT_BASIC_MAX are handled with separate opcodes, as
+above. If there are more, the opcode is set to EXTRACT_BASIC_MAX+1, and the
+first operation in the bracket is OP_BRANUMBER, followed by a 2-byte bracket
+number. This opcode is ignored while matching, but is fished out when handling
+the bracket itself. (They could have all been done like this, but I was making
+minimal changes.)
+
A bracket opcode is followed by two bytes which give the offset to the next
alternative OP_ALT or, if there aren't any branches, to the matching KET
opcode. Each OP_ALT is followed by two bytes giving the offset to the next one,
@@ -191,8 +201,8 @@ appropriate.
A subpattern with a bounded maximum repetition is replicated in a nested
fashion up to the maximum number of times, with BRAZERO or BRAMINZERO before
each replication after the minimum, so that, for example, (abc){2,5} is
-compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 200-bracket limit does not
-apply to these internally generated brackets.
+compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 99 and 200 bracket limits do
+not apply to these internally generated brackets.
Assertions
@@ -202,9 +212,10 @@ Forward assertions are just like other subpatterns, but starting with one of
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
is OP_REVERSE, followed by a two byte count of the number of characters to move
-back the pointer in the subject string. A separate count is present in each
-alternative of a lookbehind assertion, allowing them to have different fixed
-lengths.
+back the pointer in the subject string. When operating in UTF-8 mode, the count
+is a character count rather than a byte count. A separate count is present in
+each alternative of a lookbehind assertion, allowing them to have different
+fixed lengths.
Once-only subpatterns
@@ -219,7 +230,7 @@ Conditional subpatterns
These are like other subpatterns, but they start with the opcode OP_COND. If
the condition is a back reference, this is stored at the start of the
-subpattern using the opcode OP_CREF followed by one byte containing the
+subpattern using the opcode OP_CREF followed by two bytes containing the
reference number. Otherwise, a conditional subpattern will always start with
one of the assertions.
@@ -239,4 +250,4 @@ the compiled data.
Philip Hazel
-February 2000
+August 2001