summaryrefslogtreecommitdiff
path: root/HACKING
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2016-10-28 16:08:44 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2016-10-28 16:08:44 +0000
commit41d9ea3a465d5b3fd1f513e2b5565fe940b148d0 (patch)
tree6558a41f69c51ff272132882c30cac6027e3d79a /HACKING
parent67512207d7c4ca0485c7646e54dc67b09422e5a3 (diff)
downloadpcre2-41d9ea3a465d5b3fd1f513e2b5565fe940b148d0.tar.gz
Fix error in documentation.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@580 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'HACKING')
-rw-r--r--HACKING146
1 files changed, 77 insertions, 69 deletions
diff --git a/HACKING b/HACKING
index 53a0080..c51da7b 100644
--- a/HACKING
+++ b/HACKING
@@ -7,7 +7,7 @@ but with a revised (and incompatible) API. To avoid confusion, the original
library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file.
-PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
+PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.
@@ -124,39 +124,39 @@ compile) has full knowledge of group names and numbers throughout. Several
dozen lines of messy code were eliminated, though the new pre-pass was not
short. In particular, parsing and skipping over [] classes is complicated.
-While working on 10.22 I realized that I could simplify yet again by moving
+While working on 10.22 I realized that I could simplify yet again by moving
more of the parsing into the pre-pass, thus avoiding doing it in two places, so
after 10.22 was released, the code underwent yet another big refactoring. This
is how it is from 10.23 onwards:
-The function called parse_regex() scans the pattern characters, parsing them
-into literal data and meta characters. It converts escapes such as \x{123}
-into literals, handles \Q...\E, and skips over comments and non-significant
-white space. The result of the scanning is put into a vector of 32-bit unsigned
-integers. Values less than 0x80000000 are literal data. Higher values represent
+The function called parse_regex() scans the pattern characters, parsing them
+into literal data and meta characters. It converts escapes such as \x{123}
+into literals, handles \Q...\E, and skips over comments and non-significant
+white space. The result of the scanning is put into a vector of 32-bit unsigned
+integers. Values less than 0x80000000 are literal data. Higher values represent
meta-characters. The top 16-bits of such values identify the meta-character,
and these are given names such as META_CAPTURE. The lower 16-bits are available
-for data, for example, the capturing group number. The only situation in which
-literal data values greater than 0x7fffffff can appear is when the 32-bit
-library is running in non-UTF mode. This is handled by having a special
+for data, for example, the capturing group number. The only situation in which
+literal data values greater than 0x7fffffff can appear is when the 32-bit
+library is running in non-UTF mode. This is handled by having a special
meta-character that is followed by the 32-bit data value.
The size of the parsed pattern vector, when auto-callouts are not enabled, is
-bounded by the length of the pattern (with one exception). The code is written
-so that each item in the pattern uses no more vector elements than the number
+bounded by the length of the pattern (with one exception). The code is written
+so that each item in the pattern uses no more vector elements than the number
of code units in the item itself. The exception is the aforementioned large
32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
advance to check for such values. When auto-callouts are enabled, the generous
assumption is made that there will be a callout for each pattern code unit
-(which of course is only actually true if all code units are literals) plus one
+(which of course is only actually true if all code units are literals) plus one
at the end. There is a default parsed pattern vector on the stack, but if this
is not big enough, heap memory is used.
-As before, the actual compiling function is run twice, the first time to
-determine the amount of memory needed for the final compiled pattern. It
+As before, the actual compiling function is run twice, the first time to
+determine the amount of memory needed for the final compiled pattern. It
now processes the parsed pattern vector, not the pattern itself, although some
of the parsed items refer to strings in the pattern - for example, group
-names. As escapes and comments have already been processed, the code is a bit
+names. As escapes and comments have already been processed, the code is a bit
simpler than before.
Most errors can be diagnosed during the parsing scan. For those that cannot
@@ -168,64 +168,67 @@ identify where errors occur.
The elements of the parsed pattern vector
-----------------------------------------
-The word "offset" below means a code unit offset into the pattern. When
+The word "offset" below means a code unit offset into the pattern. When
PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
stored in a single parsed pattern element. Otherwise (typically on 64-bit
systems) it occupies two elements. The following meta items occupy just one
element, with no data:
META_ACCEPT (*ACCEPT)
-META_ALT | alternation
-META_ASTERISK *
-META_ASTERISK_PLUS *+
-META_ASTERISK_QUERY *?
-META_ATOMIC (?> start of atomic group
-META_CIRCUMFLEX ^ metacharacter
-META_CLASS [ start of non-empty class
-META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
+META_ASTERISK *
+META_ASTERISK_PLUS *+
+META_ASTERISK_QUERY *?
+META_ATOMIC (?> start of atomic group
+META_CIRCUMFLEX ^ metacharacter
+META_CLASS [ start of non-empty class
+META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
META_CLASS_EMPTY_NOT [^] negative empty class - ditto
-META_CLASS_END ] end of non-empty class
-META_CLASS_NOT [^ start non-empty negative class
+META_CLASS_END ] end of non-empty class
+META_CLASS_NOT [^ start non-empty negative class
META_COMMIT (*COMMIT)
-META_DOLLAR $ metacharacter
-META_DOT . metacharacter
+META_DOLLAR $ metacharacter
+META_DOT . metacharacter
META_END End of pattern (this value is 0x80000000)
META_FAIL (*FAIL)
-META_KET ) closing parenthesis
+META_KET ) closing parenthesis
META_LOOKAHEAD (?= start of lookahead
META_LOOKAHEADNOT (?! start of negative lookahead
-META_NOCAPTURE (?: no capture parens
-META_PLUS +
-META_PLUS_PLUS ++
-META_PLUS_QUERY +?
+META_NOCAPTURE (?: no capture parens
+META_PLUS +
+META_PLUS_PLUS ++
+META_PLUS_QUERY +?
META_PRUNE (*PRUNE) - no argument
-META_QUERY ?
-META_QUERY_PLUS ?+
-META_QUERY_QUERY ??
-META_RANGE_ESCAPED hyphen in class range with at least one escape
-META_RANGE_LITERAL hyphen in class range defined literally
+META_QUERY ?
+META_QUERY_PLUS ?+
+META_QUERY_QUERY ??
+META_RANGE_ESCAPED hyphen in class range with at least one escape
+META_RANGE_LITERAL hyphen in class range defined literally
META_SKIP (*SKIP) - no argument
META_THEN (*THEN) - no argument
-The two RANGE values occur only in character classes. They are positioned
-between two literals that define the start and end of the range. In an EBCDIC
-evironment it is necessary to know whether either of the range values was
-specified as an escape. In an ASCII/Unicode environment the distinction is not
+The two RANGE values occur only in character classes. They are positioned
+between two literals that define the start and end of the range. In an EBCDIC
+evironment it is necessary to know whether either of the range values was
+specified as an escape. In an ASCII/Unicode environment the distinction is not
relevant.
-The following have data in the lower 16 bits, and may be followed by other data
+The following have data in the lower 16 bits, and may be followed by other data
elements:
+META_ALT | alternation
META_BACKREF
META_CAPTURE
META_ESCAPE
META_RECURSE
+If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
+is the length of its branch, for which OP_REVERSE must be generated.
+
META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
their data in the lower 16 bits of the element.
META_BACKREF is followed by an offset if the back reference group number is 10
-or more. The offsets of the first ocurrences of references to groups whose
+or more. The offsets of the first ocurrences of references to groups whose
numbers are less than 10 are put in cb->small_ref_offset[] (only the first
occurrence is useful). On 64-bit systems this avoids using more than two parsed
pattern elements for items such as \3. The offset is used when an error is
@@ -241,7 +244,7 @@ and an offset into the pattern to specify the name.
The following have one data item that follows in the next vector element:
-META_BIGVALUE Next is a literal >= META_END
+META_BIGVALUE Next is a literal >= META_END
META_OPTIONS (?i) and friends (data is new option bits)
META_POSIX POSIX class item (data identifies the class)
META_POSIX_NEG negative POSIX class item (ditto)
@@ -249,19 +252,19 @@ META_POSIX_NEG negative POSIX class item (ditto)
The following are followed by a length element, then a number of character code
values (which should match with the length):
-META_MARK (*MARK:xxxx)
+META_MARK (*MARK:xxxx)
META_PRUNE_ARG (*PRUNE:xxx)
META_SKIP_ARG (*SKIP:xxxx)
META_THEN_ARG (*THEN:xxxx)
-The following are followed by a length element, then an offset in the pattern
+The following are followed by a length element, then an offset in the pattern
that identifies the name:
-META_COND_NAME (?(<name>) or (?('name') or (?(name)
-META_COND_RNAME (?(R&name)
+META_COND_NAME (?(<name>) or (?('name') or (?(name)
+META_COND_RNAME (?(R&name)
META_COND_RNUMBER (?(Rdigits)
-META_RECURSE_BYNAME (?&name)
-META_BACKREF_BYNAME \k'name'
+META_RECURSE_BYNAME (?&name)
+META_BACKREF_BYNAME \k'name'
META_COND_RNUMBER is used for names that start with R and continue with digits,
because this is an ambiguous case. It could be a back reference to a group with
@@ -269,26 +272,31 @@ that name, or it could be a recursion test on a numbered group.
This one is followed by an offset, for use in error messages, then a number:
-META_COND_NUMBER (?([+-]digits)
+META_COND_NUMBER (?([+-]digits)
The following are followed just by an offset, for use in error messages:
META_COND_ASSERT (?(?assertion)
META_COND_DEFINE (?(DEFINE)
-META_LOOKBEHIND (?<=
-META_LOOKBEHINDNOT (?<!
-In fact, META_COND_ASSERT is used for any group starting (?( that does not
-match any of the other META_COND cases. The check that this group is an
-assertion (optionally preceded by a callout) happens at compile time.
+In fact, META_COND_ASSERT is used for any group starting (?( that does not
+match any of the other META_COND cases. The check that this group is an
+assertion (optionally preceded by a callout) happens at compile time.
+
+The following are also followed just by an offset, but also the lower 16 bits
+of the main word contain the length of the first branch of the lookbehind
+group; this is used when generating OP_REVERSE for that branch.
+
+META_LOOKBEHIND (?<=
+META_LOOKBEHINDNOT (?<!
The following are followed by two values, the minimum and maximum. Repeat
values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
-META_MINMAX {n,m} repeat
-META_MINMAX_PLUS {n,m}+ repeat
-META_MINMAX_QUERY {n,m}? repeat
+META_MINMAX {n,m} repeat
+META_MINMAX_PLUS {n,m}+ repeat
+META_MINMAX_QUERY {n,m}? repeat
This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
the next two are the major and minor numbers:
@@ -297,11 +305,11 @@ META_COND_VERSION (?(VERSION<op>x.y)
Callouts are converted into one of two items:
-META_CALLOUT_NUMBER (?C with numerical argument
-META_CALLOUT_STRING (?C with string argument
+META_CALLOUT_NUMBER (?C with numerical argument
+META_CALLOUT_STRING (?C with string argument
-In both cases, the next two elements contain the offset and length of the next
-item in the pattern. Then there is either one callout number, or a length and
+In both cases, the next two elements contain the offset and length of the next
+item in the pattern. Then there is either one callout number, or a length and
an offset for the string argument. The length includes both delimiters.
@@ -410,11 +418,11 @@ These items are all just one unit long
OP_THEN )
OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
-This ends the assertion, not the entire pattern match. The assertion (?!) is
+This ends the assertion, not the entire pattern match. The assertion (?!) is
always optimized to OP_FAIL.
OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
-non-UTF modes and in UTF-32 mode (since one code unit still equals one
+non-UTF modes and in UTF-32 mode (since one code unit still equals one
character). Another use is for [^] when empty classes are permitted
(PCRE2_ALLOW_EMPTY_CLASS is set).
@@ -735,8 +743,8 @@ immediately before the assertion. It is also possible to insert a manual
callout at this point. Only assertion conditions may have callouts preceding
the condition.
-A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
-parts of the pattern, so this is another opcode that may appear as a condition.
+A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
+parts of the pattern, so this is another opcode that may appear as a condition.
It is treated the same as OP_FALSE.