diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-12-28 16:10:09 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-12-28 16:10:09 +0000 |
commit | a29cc4dc66d82b59de7616c53517c58271e6e0e8 (patch) | |
tree | c74caa3f756e12f475c840392d507a89bcfe8bc8 /doc/html | |
parent | 77b62a421481e0547788d4c0dc7539ac7f41d85b (diff) | |
download | pcre-a29cc4dc66d82b59de7616c53517c58271e6e0e8.tar.gz |
Rolled back trunk to r755 to prepare for merging the 16-bit branch.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@835 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/html')
-rw-r--r-- | doc/html/pcreapi.html | 36 | ||||
-rw-r--r-- | doc/html/pcrecallout.html | 9 | ||||
-rw-r--r-- | doc/html/pcrecompat.html | 5 | ||||
-rw-r--r-- | doc/html/pcrejit.html | 141 | ||||
-rw-r--r-- | doc/html/pcrelimits.html | 8 | ||||
-rw-r--r-- | doc/html/pcrematching.html | 8 | ||||
-rw-r--r-- | doc/html/pcrepattern.html | 148 | ||||
-rw-r--r-- | doc/html/pcretest.html | 7 |
8 files changed, 108 insertions, 254 deletions
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html index 3cbb6be..cd90766 100644 --- a/doc/html/pcreapi.html +++ b/doc/html/pcreapi.html @@ -649,23 +649,6 @@ character). Thus, the pattern AB]CD becomes illegal when this option is set. string (by default this causes the current matching alternative to fail). A pattern such as (\1)(a) succeeds when this option is set (assuming it can find an "a" in the subject), whereas it fails by default, for Perl compatibility. -</P> -<P> -(3) \U matches an upper case "U" character; by default \U causes a compile -time error (Perl uses \U to upper case subsequent characters). -</P> -<P> -(4) \u matches a lower case "u" character unless it is followed by four -hexadecimal digits, in which case the hexadecimal number defines the code point -to match. By default, \u causes a compile time error (Perl uses it to upper -case the following character). -</P> -<P> -(5) \x matches a lower case "x" character unless it is followed by two -hexadecimal digits, in which case the hexadecimal number defines the code point -to match. By default, as in Perl, a hexadecimal number is always expected after -\x, but it may have zero, one, or two digits (so, for example, \xz matches a -binary zero character followed by z). <pre> PCRE_MULTILINE </pre> @@ -1144,12 +1127,6 @@ particular pattern. See the <a href="pcrejit.html"><b>pcrejit</b></a> documentation for details of what can and cannot be handled. <pre> - PCRE_INFO_JITSIZE -</pre> -If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE option, -return the size of the JIT compiled code, otherwise return zero. The fourth -argument should point to a <b>size_t</b> variable. -<pre> PCRE_INFO_LASTLITERAL </pre> Return the value of the rightmost literal byte that must exist in any matched @@ -1258,13 +1235,10 @@ For such patterns, the PCRE_ANCHORED bit is set in the options returned by <pre> PCRE_INFO_SIZE </pre> -Return the size of the compiled pattern. The fourth argument should point to a -<b>size_t</b> variable. This value does not include the size of the <b>pcre</b> -structure that is returned by <b>pcre_compile()</b>. The value that is passed as -the argument to <b>pcre_malloc()</b> when <b>pcre_compile()</b> is getting memory -in which to place the compiled data is the value returned by this option plus -the size of the <b>pcre</b> structure. Studying a compiled pattern, with or -without JIT, does not alter the value returned by this option. +Return the size of the compiled pattern, that is, the value that was passed as +the argument to <b>pcre_malloc()</b> when PCRE was getting memory in which to +place the compiled data. The fourth argument should point to a <b>size_t</b> +variable. <pre> PCRE_INFO_STUDYSIZE </pre> @@ -2512,7 +2486,7 @@ Cambridge CB2 3QH, England. </P> <br><a name="SEC24" href="#TOC1">REVISION</a><br> <P> -Last updated: 02 December 2011 +Last updated: 23 September 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> diff --git a/doc/html/pcrecallout.html b/doc/html/pcrecallout.html index e891fdf..40d5fa2 100644 --- a/doc/html/pcrecallout.html +++ b/doc/html/pcrecallout.html @@ -189,10 +189,9 @@ same callout number. However, they are set for all callouts. <P> The <i>mark</i> field is present from version 2 of the <i>pcre_callout</i> structure. In callouts from <b>pcre_exec()</b> it contains a pointer to the -zero-terminated name of the most recently passed (*MARK), (*PRUNE), or (*THEN) -item in the match, or NULL if no such items have been passed. Instances of -(*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In -callouts from <b>pcre_dfa_exec()</b> this field always contains NULL. +zero-terminated name of the most recently passed (*MARK) item in the match, or +NULL if there are no (*MARK)s in the current matching path. In callouts from +<b>pcre_dfa_exec()</b> this field always contains NULL. </P> <br><a name="SEC4" href="#TOC1">RETURN VALUES</a><br> <P> @@ -220,7 +219,7 @@ Cambridge CB2 3QH, England. </P> <br><a name="SEC6" href="#TOC1">REVISION</a><br> <P> -Last updated: 30 November 2011 +Last updated: 26 August 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> diff --git a/doc/html/pcrecompat.html b/doc/html/pcrecompat.html index 4e5e18b..69d9d1d 100644 --- a/doc/html/pcrecompat.html +++ b/doc/html/pcrecompat.html @@ -53,8 +53,7 @@ represent a binary zero. own, matching a non-newline character, is supported.) In fact these are implemented by Perl's general string-handling and are not part of its pattern matching engine. If any of these are encountered by PCRE, an error is -generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, -\U and \u are interpreted as JavaScript interprets them. +generated. </P> <P> 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE is @@ -203,7 +202,7 @@ Cambridge CB2 3QH, England. REVISION </b><br> <P> -Last updated: 14 November 2011 +Last updated: 09 October 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> diff --git a/doc/html/pcrejit.html b/doc/html/pcrejit.html index 7411ecf..c257d0d 100644 --- a/doc/html/pcrejit.html +++ b/doc/html/pcrejit.html @@ -20,11 +20,10 @@ man page, in case the conversion went wrong. <li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT EXECUTION</a> <li><a name="TOC6" href="#SEC6">SAVING AND RESTORING COMPILED PATTERNS</a> <li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a> -<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a> -<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a> -<li><a name="TOC10" href="#SEC10">SEE ALSO</a> -<li><a name="TOC11" href="#SEC11">AUTHOR</a> -<li><a name="TOC12" href="#SEC12">REVISION</a> +<li><a name="TOC8" href="#SEC8">EXAMPLE CODE</a> +<li><a name="TOC9" href="#SEC9">SEE ALSO</a> +<li><a name="TOC10" href="#SEC10">AUTHOR</a> +<li><a name="TOC11" href="#SEC11">REVISION</a> </ul> <br><a name="SEC1" href="#TOC1">PCRE JUST-IN-TIME COMPILER SUPPORT</a><br> <P> @@ -58,17 +57,11 @@ fully tested. If --enable-jit is set on an unsupported platform, compilation fails. </P> <P> -A program that is linked with PCRE 8.20 or later can tell if JIT support is -available by calling <b>pcre_config()</b> with the PCRE_CONFIG_JIT option. The -result is 1 when JIT is available, and 0 otherwise. However, a simple program -does not need to check this in order to use JIT. The API is implemented in a -way that falls back to the ordinary PCRE code if JIT is not available. -</P> -<P> -If your program may sometimes be linked with versions of PCRE that are older -than 8.20, but you want to use JIT when it is available, you can test -the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT macro such -as PCRE_CONFIG_JIT, for compile-time control of your code. +A program can tell if JIT support is available by calling <b>pcre_config()</b> +with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available, and 0 +otherwise. However, a simple program does not need to check this in order to +use JIT. The API is implemented in a way that falls back to the ordinary PCRE +code if JIT is not available. </P> <br><a name="SEC3" href="#TOC1">SIMPLE USE OF JIT</a><br> <P> @@ -82,21 +75,6 @@ You have to do two things to make use of the JIT support in the simplest way: no longer needed instead of just freeing it yourself. This ensures that any JIT data is also freed. </pre> -For a program that may be linked with pre-8.20 versions of PCRE, you can insert -<pre> - #ifndef PCRE_STUDY_JIT_COMPILE - #define PCRE_STUDY_JIT_COMPILE 0 - #endif -</pre> -so that no option is passed to <b>pcre_study()</b>, and then use something like -this to free the study data: -<pre> - #ifdef PCRE_CONFIG_JIT - pcre_free_study(study_ptr); - #else - pcre_free(study_ptr); - #endif -</pre> In some circumstances you may need to call additional functions. These are described in the section entitled <a href="#stackcontrol">"Controlling the JIT stack"</a> @@ -138,8 +116,12 @@ supported. <P> The unsupported pattern items are: <pre> - \C match a single byte; not supported in UTF-8 mode + \C match a single byte; not supported in UTF-8 mode (?Cn) callouts + (?(<name>)... conditional test on setting of a named subpattern + (?(R)... conditional test on whole pattern recursion + (?(Rn)... conditional test on recursion, by number + (?(R&name)... conditional test on recursion, by name (*COMMIT) ) (*MARK) ) (*PRUNE) ) the backtracking control verbs @@ -185,10 +167,7 @@ When the compiled JIT code runs, it needs a block of memory to use as a stack. By default, it uses 32K on the machine stack. However, some large or complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. Three functions are provided for -managing blocks of memory for use as JIT stacks. There is further discussion -about the use of JIT stacks in the section entitled -<a href="#stackcontrol">"JIT stack FAQ"</a> -below. +managing blocks of memory for use as JIT stacks. </P> <P> The <b>pcre_jit_stack_alloc()</b> function creates a JIT stack. Its arguments @@ -255,86 +234,8 @@ All the functions described in this section do nothing if JIT is not available, and <b>pcre_assign_jit_stack()</b> does nothing unless the <b>extra</b> argument is non-NULL and points to a <b>pcre_extra</b> block that is the result of a successful study with PCRE_STUDY_JIT_COMPILE. -<a name="stackfaq"></a></P> -<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br> -<P> -(1) Why do we need JIT stacks? -<br> -<br> -PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where -the local data of the current node is pushed before checking its child nodes. -Allocating real machine stack on some platforms is difficult. For example, the -stack chain needs to be updated every time if we extend the stack on PowerPC. -Although it is possible, its updating time overhead decreases performance. So -we do the recursion in memory. -</P> -<P> -(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>? -<br> -<br> -Modern operating systems have a nice feature: they can reserve an address space -instead of allocating memory. We can safely allocate memory pages inside this -address space, so the stack could grow without moving memory data (this is -important because of pointers). Thus we can allocate 1M address space, and use -only a single memory page (usually 4K) if that is enough. However, we can still -grow up to 1M anytime if needed. -</P> -<P> -(3) Who "owns" a JIT stack? -<br> -<br> -The owner of the stack is the user program, not the JIT studied pattern or -anything else. The user program must ensure that if a stack is used by -<b>pcre_exec()</b>, (that is, it is assigned to the pattern currently running), -that stack must not be used by any other threads (to avoid overwriting the same -memory area). The best practice for multithreaded programs is to allocate a -stack for each thread, and return this stack through the JIT callback function. -</P> -<P> -(4) When should a JIT stack be freed? -<br> -<br> -You can free a JIT stack at any time, as long as it will not be used by -<b>pcre_exec()</b> again. When you assign the stack to a pattern, only a pointer -is set. There is no reference counting or any other magic. You can free the -patterns and stacks in any order, anytime. Just <i>do not</i> call -<b>pcre_exec()</b> with a pattern pointing to an already freed stack, as that -will cause SEGFAULT. (Also, do not free a stack currently used by -<b>pcre_exec()</b> in another thread). You can also replace the stack for a -pattern at any time. You can even free the previous stack before assigning a -replacement. -</P> -<P> -(5) Should I allocate/free a stack every time before/after calling -<b>pcre_exec()</b>? -<br> -<br> -No, because this is too costly in terms of resources. However, you could -implement some clever idea which release the stack if it is not used in let's -say two minutes. The JIT callback can help to achive this without keeping a -list of the currently JIT studied patterns. -</P> -<P> -(6) OK, the stack is for long term memory allocation. But what happens if a -pattern causes stack overflow with a stack of 1M? Is that 1M kept until the -stack is freed? -<br> -<br> -Especially on embedded sytems, it might be a good idea to release -memory sometimes without freeing the stack. There is no API for this at the -moment. Probably a function call which returns with the currently allocated -memory for any stack and another which allows releasing memory (shrinking the -stack) would be a good idea if someone needs this. -</P> -<P> -(7) This is too much of a headache. Isn't there any better solution for JIT -stack handling? -<br> -<br> -No, thanks to Windows. If POSIX threads were used everywhere, we could throw -out this complicated API. </P> -<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br> +<br><a name="SEC8" href="#TOC1">EXAMPLE CODE</a><br> <P> This is a single-threaded example that specifies a JIT stack without using a callback. @@ -359,22 +260,22 @@ callback. </PRE> </P> -<br><a name="SEC10" href="#TOC1">SEE ALSO</a><br> +<br><a name="SEC9" href="#TOC1">SEE ALSO</a><br> <P> <b>pcreapi</b>(3) </P> -<br><a name="SEC11" href="#TOC1">AUTHOR</a><br> +<br><a name="SEC10" href="#TOC1">AUTHOR</a><br> <P> -Philip Hazel (FAQ by Zoltan Herczeg) +Philip Hazel <br> University Computing Service <br> Cambridge CB2 3QH, England. <br> </P> -<br><a name="SEC12" href="#TOC1">REVISION</a><br> +<br><a name="SEC11" href="#TOC1">REVISION</a><br> <P> -Last updated: 26 November 2011 +Last updated: 19 October 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> diff --git a/doc/html/pcrelimits.html b/doc/html/pcrelimits.html index 2cab81f..4dc28f7 100644 --- a/doc/html/pcrelimits.html +++ b/doc/html/pcrelimits.html @@ -37,12 +37,6 @@ There is no limit to the number of parenthesized subpatterns, but there can be no more than 65535 capturing subpatterns. </P> <P> -There is a limit to the number of forward references to subsequent subpatterns -of around 200,000. Repeated forward references with fixed upper limits, for -example, (?2){0,100} when subpattern number 2 is to the right, are included in -the count. There is no limit to the number of backward references. -</P> -<P> The maximum length of name for a named subpattern is 32 characters, and the maximum number of named subpatterns is 10000. </P> @@ -71,7 +65,7 @@ Cambridge CB2 3QH, England. REVISION </b><br> <P> -Last updated: 30 November 2011 +Last updated: 24 August 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html index ad17c98..3d1acf6 100644 --- a/doc/html/pcrematching.html +++ b/doc/html/pcrematching.html @@ -164,9 +164,9 @@ always 1, and the value of the <i>capture_last</i> field is always -1. </P> <P> 7. The \C escape sequence, which (in the standard algorithm) matches a single -byte, even in UTF-8 mode, is not supported in UTF-8 mode, because the -alternative algorithm moves through the subject string one character at a time, -for all active paths through the tree. +byte, even in UTF-8 mode, is not supported because the alternative algorithm +moves through the subject string one character at a time, for all active paths +through the tree. </P> <P> 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not @@ -220,7 +220,7 @@ Cambridge CB2 3QH, England. </P> <br><a name="SEC8" href="#TOC1">REVISION</a><br> <P> -Last updated: 19 November 2011 +Last updated: 17 November 2010 <br> Copyright © 1997-2010 University of Cambridge. <br> diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html index aa39d63..349c98c 100644 --- a/doc/html/pcrepattern.html +++ b/doc/html/pcrepattern.html @@ -268,8 +268,7 @@ one of the following escape sequences than the binary character it represents: \t tab (hex 09) \ddd character with octal code ddd, or back reference \xhh character with hex code hh - \x{hhh..} character with hex code hhh.. (non-JavaScript mode) - \uhhhh character with hex code hhhh (JavaScript mode only) + \x{hhh..} character with hex code hhh.. </pre> The precise effect of \cx is as follows: if x is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. @@ -281,12 +280,12 @@ values are valid. A lower case letter is converted to upper case, and then the 0xc0 bits are flipped.) </P> <P> -By default, after \x, from zero to two hexadecimal digits are read (letters -can be in upper or lower case). Any number of hexadecimal digits may appear -between \x{ and }, but the value of the character code must be less than 256 -in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum -value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest -Unicode code point, which is 10FFFF. +After \x, from zero to two hexadecimal digits are read (letters can be in +upper or lower case). Any number of hexadecimal digits may appear between \x{ +and }, but the value of the character code must be less than 256 in non-UTF-8 +mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in +hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code +point, which is 10FFFF. </P> <P> If characters other than hexadecimal digits appear between \x{ and }, or if @@ -295,17 +294,9 @@ initial \x will be interpreted as a basic hexadecimal escape, with no following digits, giving a character whose value is zero. </P> <P> -If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is -as just described only when it is followed by two hexadecimal digits. -Otherwise, it matches a literal "x" character. In JavaScript mode, support for -code points greater than 256 is provided by \u, which must be followed by -four hexadecimal digits; otherwise it matches a literal "u" character. -</P> -<P> Characters whose value is less than 256 can be defined by either of the two -syntaxes for \x (or by \u in JavaScript mode). There is no difference in the -way they are handled. For example, \xdc is exactly the same as \x{dc} (or -\u00dc in JavaScript mode). +syntaxes for \x. There is no difference in the way they are handled. For +example, \xdc is exactly the same as \x{dc}. </P> <P> After \0 up to two further octal digits are read. If there are fewer than two @@ -347,25 +338,12 @@ zero, because no more than three octal digits are ever read. </P> <P> All the sequences that define a single character value can be used both inside -and outside character classes. In addition, inside a character class, \b is -interpreted as the backspace character (hex 08). -</P> -<P> -\N is not allowed in a character class. \B, \R, and \X are not special -inside a character class. Like other unrecognized escape sequences, they are -treated as the literal characters "B", "R", and "X" by default, but cause an -error if the PCRE_EXTRA option is set. Outside a character class, these -sequences have different meanings. -</P> -<br><b> -Unsupported escape sequences -</b><br> -<P> -In Perl, the sequences \l, \L, \u, and \U are recognized by its string -handler and used to modify the case of following characters. By default, PCRE -does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT -option is set, \U matches a "U" character, and \u can be used to define a -character by code point, as described in the previous section. +and outside character classes. In addition, inside a character class, the +sequence \b is interpreted as the backspace character (hex 08). The sequences +\B, \N, \R, and \X are not special inside a character class. Like any other +unrecognized escape sequences, they are treated as the literal characters "B", +"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is +set. Outside a character class, these sequences have different meanings. </P> <br><b> Absolute and relative back references @@ -411,8 +389,7 @@ Another use of backslash is for specifying generic character types: There is also the single sequence \N, which matches a non-newline character. This is the same as <a href="#fullstopdot">the "." metacharacter</a> -when PCRE_DOTALL is not set. Perl also uses \N to match characters by name; -PCRE does not support this. +when PCRE_DOTALL is not set. </P> <P> Each pair of lower and upper case escape sequences partitions the complete set @@ -986,8 +963,7 @@ special meaning in a character class. <P> The escape sequence \N behaves like a dot, except that it is not affected by the PCRE_DOTALL option. In other words, it matches any character except one -that signifies the end of a line. Perl also uses \N to match characters by -name; PCRE does not support this. +that signifies the end of a line. </P> <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> <P> @@ -1003,8 +979,8 @@ processing unless the PCRE_NO_UTF8_CHECK option is used). </P> <P> PCRE does not allow \C to appear in lookbehind assertions -<a href="#lookbehind">(described below)</a> -in UTF-8 mode, because this would make it impossible to calculate the length of +<a href="#lookbehind">(described below),</a> +because in UTF-8 mode this would make it impossible to calculate the length of the lookbehind. </P> <P> @@ -1950,10 +1926,10 @@ match. If there are insufficient characters before the current position, the assertion fails. </P> <P> -In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, -even in UTF-8 mode) to appear in lookbehind assertions, because it makes it -impossible to calculate the length of the lookbehind. The \X and \R escapes, -which can match different numbers of bytes, are also not permitted. +PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode) +to appear in lookbehind assertions, because it makes it impossible to calculate +the length of the lookbehind. The \X and \R escapes, which can match +different numbers of bytes, are also not permitted. </P> <P> <a href="#subpatternsassubroutines">"Subroutine"</a> @@ -2535,11 +2511,10 @@ failing negative assertion, they cause an error if encountered by If any of these verbs are used in an assertion or in a subpattern that is called as a subroutine (whether or not recursively), their effect is confined to that subpattern; it does not extend to the surrounding pattern, with one -exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in -a successful positive assertion <i>is</i> passed back when a match succeeds -(compare capturing parentheses in assertions). Note that such subpatterns are -processed as anchored at the point where they are tested. Note also that Perl's -treatment of subroutines is different in some cases. +exception: a *MARK that is encountered in a positive assertion <i>is</i> passed +back (compare capturing parentheses in assertions). Note that such subpatterns +are processed as anchored at the point where they are tested. Note also that +Perl's treatment of subroutines is different in some cases. </P> <P> The new verbs make use of what was previously invalid syntax: an opening @@ -2561,10 +2536,6 @@ the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the pattern with (*NO_START_OPT). </P> -<P> -Experiments with Perl suggest that it too has similar optimizations, sometimes -leading to anomalous results. -</P> <br><b> Verbs that act immediately </b><br> @@ -2612,17 +2583,17 @@ A name is always required with this verb. There may be as many instances of (*MARK) as you like in a pattern, and their names do not have to be unique. </P> <P> -When a match succeeds, the name of the last-encountered (*MARK) on the matching -path is passed back to the caller via the <i>pcre_extra</i> data structure, as -described in the +When a match succeeds, the name of the last-encountered (*MARK) is passed back +to the caller via the <i>pcre_extra</i> data structure, as described in the <a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> in the <a href="pcreapi.html"><b>pcreapi</b></a> -documentation. Here is an example of <b>pcretest</b> output, where the /K -modifier requests the retrieval and outputting of (*MARK) data: +documentation. No data is returned for a partial match. Here is an example of +<b>pcretest</b> output, where the /K modifier requests the retrieval and +outputting of (*MARK) data: <pre> - re> /X(*MARK:A)Y|X(*MARK:B)Z/K - data> XY + /X(*MARK:A)Y|X(*MARK:B)Z/K + XY 0: XY MK: A XZ @@ -2640,17 +2611,32 @@ passed back if it is the last-encountered. This does not happen for negative assertions. </P> <P> -After a partial match or a failed match, the name of the last encountered -(*MARK) in the entire match process is returned. For example: +A name may also be returned after a failed match if the final path through the +pattern involves (*MARK). However, unless (*MARK) used in conjunction with +(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the +starting point for matching is advanced, the final check is often with an empty +string, causing a failure before (*MARK) is reached. For example: <pre> - re> /X(*MARK:A)Y|X(*MARK:B)Z/K - data> XP + /X(*MARK:A)Y|X(*MARK:B)Z/K + XP + No match +</pre> +There are three potential starting points for this match (starting with X, +starting with P, and with an empty string). If the pattern is anchored, the +result is different: +<pre> + /^X(*MARK:A)Y|^X(*MARK:B)Z/K + XP No match, mark = B </pre> -Note that in this unanchored example the mark is retained from the match -attempt that started at the letter "X". Subsequent match attempts starting at -"P" and then with an empty string do not get as far as the (*MARK) item, but -nevertheless do not reset it. +PCRE's start-of-match optimizations can also interfere with this. For example, +if, as a result of a call to <b>pcre_study()</b>, it knows the minimum +subject length for a match, a shorter subject will not be scanned at all. +</P> +<P> +Note that similar anomalies (though different in detail) exist in Perl, no +doubt for the same reasons. The use of (*MARK) data after a failed match of an +unanchored pattern is not recommended, unless (*COMMIT) is involved. </P> <br><b> Verbs that act after backtracking @@ -2689,8 +2675,8 @@ Note that (*COMMIT) at the start of a pattern is not the same as an anchor, unless PCRE's start-of-match optimizations are turned off, as shown in this <b>pcretest</b> example: <pre> - re> /(*COMMIT)abc/ - data> xyzabc + /(*COMMIT)abc/ + xyzabc 0: abc xyzabc\Y No match @@ -2711,8 +2697,10 @@ reached, or when matching to the right of (*PRUNE), but if there is no match to the right, backtracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alternative to an atomic group or possessive quantifier, but there are some uses of (*PRUNE) that cannot be expressed in any other way. -The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an -anchored pattern (*PRUNE) has the same effect as (*COMMIT). +The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the +match fails completely; the name is passed back if this is the final attempt. +(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored +pattern (*PRUNE) has the same effect as (*COMMIT). <pre> (*SKIP) </pre> @@ -2738,7 +2726,8 @@ following pattern fails to match, the previous path through the pattern is searched for the most recent (*MARK) that has the same name. If one is found, the "bumpalong" advance is to the subject position that corresponds to that (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a -matching name is found, the (*SKIP) is ignored. +matching name is found, normal "bumpalong" of one character happens (that is, +the (*SKIP) is ignored). <pre> (*THEN) or (*THEN:NAME) </pre> @@ -2752,8 +2741,9 @@ be used for a pattern-based if-then-else block: If the COND1 pattern matches, FOO is tried (and possibly further items after the end of the group if FOO succeeds); on failure, the matcher skips to the second alternative and tries COND2, without backtracking into COND1. The -behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). -If (*THEN) is not inside an alternation, it acts like (*PRUNE). +behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the +overall match fails. If (*THEN) is not inside an alternation, it acts like +(*PRUNE). </P> <P> Note that a subpattern that does not contain a | character is just a part of @@ -2829,7 +2819,7 @@ Cambridge CB2 3QH, England. </P> <br><a name="SEC28" href="#TOC1">REVISION</a><br> <P> -Last updated: 29 November 2011 +Last updated: 19 October 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html index c883064..40b970d 100644 --- a/doc/html/pcretest.html +++ b/doc/html/pcretest.html @@ -364,10 +364,7 @@ which it appears. </P> <P> The <b>/M</b> modifier causes the size of memory block used to hold the compiled -pattern to be output. This does not include the size of the <b>pcre</b> block; -it is just the actual compiled data. If the pattern is successfully studied -with the PCRE_STUDY_JIT_COMPILE option, the size of the JIT compiled code is -also output. +pattern to be output. </P> <P> If the <b>/S</b> modifier appears once, it causes <b>pcre_study()</b> to be @@ -859,7 +856,7 @@ Cambridge CB2 3QH, England. </P> <br><a name="SEC15" href="#TOC1">REVISION</a><br> <P> -Last updated: 02 December 2011 +Last updated: 26 August 2011 <br> Copyright © 1997-2011 University of Cambridge. <br> |