Fix error in documentation.

git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@580 6239d852-aaf2-0410-a92c-79f79f948069
author: ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> 2016-10-28 16:08:44 +0000
committer: ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> 2016-10-28 16:08:44 +0000
commit: 41d9ea3a465d5b3fd1f513e2b5565fe940b148d0 (patch)
tree: 6558a41f69c51ff272132882c30cac6027e3d79a /HACKING
parent: 67512207d7c4ca0485c7646e54dc67b09422e5a3 (diff)
download: pcre2-41d9ea3a465d5b3fd1f513e2b5565fe940b148d0.tar.gz
1 files changed, 77 insertions, 69 deletions
diff --git a/HACKING b/HACKING
index 53a0080..c51da7b 100644
--- a/HACKING
+++ b/HACKING
@@ -7,7 +7,7 @@ but with a revised (and incompatible) API. To avoid confusion, the original
 library is referred to as PCRE1 below. For information about testing PCRE2, see
 the pcre2test documentation and the comment at the head of the RunTest file.
 
-PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix 
+PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
 releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
 confusion with PCRE1.
 
@@ -124,39 +124,39 @@ compile) has full knowledge of group names and numbers throughout. Several
 dozen lines of messy code were eliminated, though the new pre-pass was not
 short. In particular, parsing and skipping over [] classes is complicated.
 
-While working on 10.22 I realized that I could simplify yet again by moving 
+While working on 10.22 I realized that I could simplify yet again by moving
 more of the parsing into the pre-pass, thus avoiding doing it in two places, so
 after 10.22 was released, the code underwent yet another big refactoring. This
 is how it is from 10.23 onwards:
 
-The function called parse_regex() scans the pattern characters, parsing them 
-into literal data and meta characters. It converts escapes such as \x{123} 
-into literals, handles \Q...\E, and skips over comments and non-significant 
-white space. The result of the scanning is put into a vector of 32-bit unsigned 
-integers. Values less than 0x80000000 are literal data. Higher values represent 
+The function called parse_regex() scans the pattern characters, parsing them
+into literal data and meta characters. It converts escapes such as \x{123}
+into literals, handles \Q...\E, and skips over comments and non-significant
+white space. The result of the scanning is put into a vector of 32-bit unsigned
+integers. Values less than 0x80000000 are literal data. Higher values represent
 meta-characters. The top 16-bits of such values identify the meta-character,
 and these are given names such as META_CAPTURE. The lower 16-bits are available
-for data, for example, the capturing group number. The only situation in which 
-literal data values greater than 0x7fffffff can appear is when the 32-bit 
-library is running in non-UTF mode. This is handled by having a special 
+for data, for example, the capturing group number. The only situation in which
+literal data values greater than 0x7fffffff can appear is when the 32-bit
+library is running in non-UTF mode. This is handled by having a special
 meta-character that is followed by the 32-bit data value.
 
 The size of the parsed pattern vector, when auto-callouts are not enabled, is
-bounded by the length of the pattern (with one exception). The code is written 
-so that each item in the pattern uses no more vector elements than the number 
+bounded by the length of the pattern (with one exception). The code is written
+so that each item in the pattern uses no more vector elements than the number
 of code units in the item itself. The exception is the aforementioned large
 32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
 advance to check for such values. When auto-callouts are enabled, the generous
 assumption is made that there will be a callout for each pattern code unit
-(which of course is only actually true if all code units are literals) plus one 
+(which of course is only actually true if all code units are literals) plus one
 at the end. There is a default parsed pattern vector on the stack, but if this
 is not big enough, heap memory is used.
 
-As before, the actual compiling function is run twice, the first time to 
-determine the amount of memory needed for the final compiled pattern. It 
+As before, the actual compiling function is run twice, the first time to
+determine the amount of memory needed for the final compiled pattern. It
 now processes the parsed pattern vector, not the pattern itself, although some
 of the parsed items refer to strings in the pattern - for example, group
-names. As escapes and comments have already been processed, the code is a bit 
+names. As escapes and comments have already been processed, the code is a bit
 simpler than before.
 
 Most errors can be diagnosed during the parsing scan. For those that cannot
@@ -168,64 +168,67 @@ identify where errors occur.
 The elements of the parsed pattern vector
 -----------------------------------------
 
-The word "offset" below means a code unit offset into the pattern. When 
+The word "offset" below means a code unit offset into the pattern. When
 PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
 stored in a single parsed pattern element. Otherwise (typically on 64-bit
 systems) it occupies two elements. The following meta items occupy just one
 element, with no data:
 
 META_ACCEPT           (*ACCEPT)
-META_ALT              | alternation 
-META_ASTERISK         *  
-META_ASTERISK_PLUS    *+ 
-META_ASTERISK_QUERY   *? 
-META_ATOMIC           (?> start of atomic group 
-META_CIRCUMFLEX       ^ metacharacter 
-META_CLASS            [ start of non-empty class 
-META_CLASS_EMPTY      [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS 
+META_ASTERISK         *
+META_ASTERISK_PLUS    *+
+META_ASTERISK_QUERY   *?
+META_ATOMIC           (?> start of atomic group
+META_CIRCUMFLEX       ^ metacharacter
+META_CLASS            [ start of non-empty class
+META_CLASS_EMPTY      [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
 META_CLASS_EMPTY_NOT  [^] negative empty class - ditto
-META_CLASS_END        ] end of non-empty class 
-META_CLASS_NOT        [^ start non-empty negative class 
+META_CLASS_END        ] end of non-empty class
+META_CLASS_NOT        [^ start non-empty negative class
 META_COMMIT           (*COMMIT)
-META_DOLLAR           $ metacharacter 
-META_DOT              . metacharacter 
+META_DOLLAR           $ metacharacter
+META_DOT              . metacharacter
 META_END              End of pattern (this value is 0x80000000)
 META_FAIL             (*FAIL)
-META_KET              ) closing parenthesis 
+META_KET              ) closing parenthesis
 META_LOOKAHEAD        (?= start of lookahead
 META_LOOKAHEADNOT     (?! start of negative lookahead
-META_NOCAPTURE        (?: no capture parens 
-META_PLUS             +  
-META_PLUS_PLUS        ++ 
-META_PLUS_QUERY       +? 
+META_NOCAPTURE        (?: no capture parens
+META_PLUS             +
+META_PLUS_PLUS        ++
+META_PLUS_QUERY       +?
 META_PRUNE            (*PRUNE) - no argument
-META_QUERY            ?  
-META_QUERY_PLUS       ?+ 
-META_QUERY_QUERY      ?? 
-META_RANGE_ESCAPED    hyphen in class range with at least one escape 
-META_RANGE_LITERAL    hyphen in class range defined literally 
+META_QUERY            ?
+META_QUERY_PLUS       ?+
+META_QUERY_QUERY      ??
+META_RANGE_ESCAPED    hyphen in class range with at least one escape
+META_RANGE_LITERAL    hyphen in class range defined literally
 META_SKIP             (*SKIP) - no argument
 META_THEN             (*THEN) - no argument
 
-The two RANGE values occur only in character classes. They are positioned 
-between two literals that define the start and end of the range. In an EBCDIC 
-evironment it is necessary to know whether either of the range values was 
-specified as an escape. In an ASCII/Unicode environment the distinction is not 
+The two RANGE values occur only in character classes. They are positioned
+between two literals that define the start and end of the range. In an EBCDIC
+evironment it is necessary to know whether either of the range values was
+specified as an escape. In an ASCII/Unicode environment the distinction is not
 relevant.
 
-The following have data in the lower 16 bits, and may be followed by other data 
+The following have data in the lower 16 bits, and may be followed by other data
 elements:
 
+META_ALT              | alternation
 META_BACKREF
 META_CAPTURE
 META_ESCAPE
 META_RECURSE
 
+If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
+is the length of its branch, for which OP_REVERSE must be generated.
+
 META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
 their data in the lower 16 bits of the element.
 
 META_BACKREF is followed by an offset if the back reference group number is 10
-or more. The offsets of the first ocurrences of references to groups whose 
+or more. The offsets of the first ocurrences of references to groups whose
 numbers are less than 10 are put in cb->small_ref_offset[] (only the first
 occurrence is useful). On 64-bit systems this avoids using more than two parsed
 pattern elements for items such as \3. The offset is used when an error is
@@ -241,7 +244,7 @@ and an offset into the pattern to specify the name.
 
 The following have one data item that follows in the next vector element:
 
-META_BIGVALUE         Next is a literal >= META_END 
+META_BIGVALUE         Next is a literal >= META_END
 META_OPTIONS          (?i) and friends (data is new option bits)
 META_POSIX            POSIX class item (data identifies the class)
 META_POSIX_NEG        negative POSIX class item (ditto)
@@ -249,19 +252,19 @@ META_POSIX_NEG        negative POSIX class item (ditto)
 The following are followed by a length element, then a number of character code
 values (which should match with the length):
 
-META_MARK             (*MARK:xxxx) 
+META_MARK             (*MARK:xxxx)
 META_PRUNE_ARG        (*PRUNE:xxx)
 META_SKIP_ARG         (*SKIP:xxxx)
 META_THEN_ARG         (*THEN:xxxx)
 
-The following are followed by a length element, then an offset in the pattern 
+The following are followed by a length element, then an offset in the pattern
 that identifies the name:
 
-META_COND_NAME        (?(<name>) or (?('name') or (?(name) 
-META_COND_RNAME       (?(R&name) 
+META_COND_NAME        (?(<name>) or (?('name') or (?(name)
+META_COND_RNAME       (?(R&name)
 META_COND_RNUMBER     (?(Rdigits)
-META_RECURSE_BYNAME   (?&name) 
-META_BACKREF_BYNAME   \k'name' 
+META_RECURSE_BYNAME   (?&name)
+META_BACKREF_BYNAME   \k'name'
 
 META_COND_RNUMBER is used for names that start with R and continue with digits,
 because this is an ambiguous case. It could be a back reference to a group with
@@ -269,26 +272,31 @@ that name, or it could be a recursion test on a numbered group.
 
 This one is followed by an offset, for use in error messages, then a number:
 
-META_COND_NUMBER       (?([+-]digits) 
+META_COND_NUMBER       (?([+-]digits)
 
 The following are followed just by an offset, for use in error messages:
 
 META_COND_ASSERT      (?(?assertion)
 META_COND_DEFINE      (?(DEFINE)
-META_LOOKBEHIND       (?<= 
-META_LOOKBEHINDNOT    (?<!
 
-In fact, META_COND_ASSERT is used for any group starting (?( that does not 
-match any of the other META_COND cases. The check that this group is an 
-assertion (optionally preceded by a callout) happens at compile time. 
+In fact, META_COND_ASSERT is used for any group starting (?( that does not
+match any of the other META_COND cases. The check that this group is an
+assertion (optionally preceded by a callout) happens at compile time.
+
+The following are also followed just by an offset, but also the lower 16 bits
+of the main word contain the length of the first branch of the lookbehind
+group; this is used when generating OP_REVERSE for that branch.
+
+META_LOOKBEHIND       (?<=
+META_LOOKBEHINDNOT    (?<!
 
 The following are followed by two values, the minimum and maximum. Repeat
 values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
 represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
 
-META_MINMAX           {n,m}  repeat 
-META_MINMAX_PLUS      {n,m}+ repeat 
-META_MINMAX_QUERY     {n,m}? repeat 
+META_MINMAX           {n,m}  repeat
+META_MINMAX_PLUS      {n,m}+ repeat
+META_MINMAX_QUERY     {n,m}? repeat
 
 This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
 the next two are the major and minor numbers:
@@ -297,11 +305,11 @@ META_COND_VERSION     (?(VERSION<op>x.y)
 
 Callouts are converted into one of two items:
 
-META_CALLOUT_NUMBER   (?C with numerical argument 
-META_CALLOUT_STRING   (?C with string argument 
+META_CALLOUT_NUMBER   (?C with numerical argument
+META_CALLOUT_STRING   (?C with string argument
 
-In both cases, the next two elements contain the offset and length of the next 
-item in the pattern. Then there is either one callout number, or a length and 
+In both cases, the next two elements contain the offset and length of the next
+item in the pattern. Then there is either one callout number, or a length and
 an offset for the string argument. The length includes both delimiters.
 
 
@@ -410,11 +418,11 @@ These items are all just one unit long
   OP_THEN                )
 
 OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
-This ends the assertion, not the entire pattern match. The assertion (?!) is 
+This ends the assertion, not the entire pattern match. The assertion (?!) is
 always optimized to OP_FAIL.
 
 OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
-non-UTF modes and in UTF-32 mode (since one code unit still equals one 
+non-UTF modes and in UTF-32 mode (since one code unit still equals one
 character). Another use is for [^] when empty classes are permitted
 (PCRE2_ALLOW_EMPTY_CLASS is set).
 
@@ -735,8 +743,8 @@ immediately before the assertion. It is also possible to insert a manual
 callout at this point. Only assertion conditions may have callouts preceding
 the condition.
 
-A condition that is the negative assertion (?!) is optimized to OP_FAIL in all 
-parts of the pattern, so this is another opcode that may appear as a condition. 
+A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
+parts of the pattern, so this is another opcode that may appear as a condition.
 It is treated the same as OP_FALSE.
author	ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>	2016-10-28 16:08:44 +0000
committer	ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>	2016-10-28 16:08:44 +0000
commit	41d9ea3a465d5b3fd1f513e2b5565fe940b148d0 (patch)
tree	6558a41f69c51ff272132882c30cac6027e3d79a /HACKING
parent	67512207d7c4ca0485c7646e54dc67b09422e5a3 (diff)
download	pcre2-41d9ea3a465d5b3fd1f513e2b5565fe940b148d0.tar.gz