diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2007-08-21 15:00:15 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2007-08-21 15:00:15 +0000 |
commit | 273487b8386264c012ab681035d19c93b4309ed3 (patch) | |
tree | 8706cad6dc3ba3dcfa9971357c00a725e8029244 /doc/html/pcrepattern.html | |
parent | c6a88bf880d462c62e00d8d7c3eeeaad60ebab49 (diff) | |
download | pcre-273487b8386264c012ab681035d19c93b4309ed3.tar.gz |
Add (*CR) etc.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@227 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/html/pcrepattern.html')
-rw-r--r-- | doc/html/pcrepattern.html | 137 |
1 files changed, 85 insertions, 52 deletions
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html index 45d8181..d9847d5 100644 --- a/doc/html/pcrepattern.html +++ b/doc/html/pcrepattern.html @@ -14,31 +14,32 @@ man page, in case the conversion went wrong. <br> <ul> <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a> -<li><a name="TOC2" href="#SEC2">CHARACTERS AND METACHARACTERS</a> -<li><a name="TOC3" href="#SEC3">BACKSLASH</a> -<li><a name="TOC4" href="#SEC4">CIRCUMFLEX AND DOLLAR</a> -<li><a name="TOC5" href="#SEC5">FULL STOP (PERIOD, DOT)</a> -<li><a name="TOC6" href="#SEC6">MATCHING A SINGLE BYTE</a> -<li><a name="TOC7" href="#SEC7">SQUARE BRACKETS AND CHARACTER CLASSES</a> -<li><a name="TOC8" href="#SEC8">POSIX CHARACTER CLASSES</a> -<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> -<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> -<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> -<li><a name="TOC12" href="#SEC12">DUPLICATE SUBPATTERN NUMBERS</a> -<li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a> -<li><a name="TOC14" href="#SEC14">REPETITION</a> -<li><a name="TOC15" href="#SEC15">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> -<li><a name="TOC16" href="#SEC16">BACK REFERENCES</a> -<li><a name="TOC17" href="#SEC17">ASSERTIONS</a> -<li><a name="TOC18" href="#SEC18">CONDITIONAL SUBPATTERNS</a> -<li><a name="TOC19" href="#SEC19">COMMENTS</a> -<li><a name="TOC20" href="#SEC20">RECURSIVE PATTERNS</a> -<li><a name="TOC21" href="#SEC21">SUBPATTERNS AS SUBROUTINES</a> -<li><a name="TOC22" href="#SEC22">CALLOUTS</a> -<li><a name="TOC23" href="#SEC23">BACTRACKING CONTROL</a> -<li><a name="TOC24" href="#SEC24">SEE ALSO</a> -<li><a name="TOC25" href="#SEC25">AUTHOR</a> -<li><a name="TOC26" href="#SEC26">REVISION</a> +<li><a name="TOC2" href="#SEC2">NEWLINE CONVENTIONS</a> +<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a> +<li><a name="TOC4" href="#SEC4">BACKSLASH</a> +<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> +<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT)</a> +<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> +<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> +<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> +<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> +<li><a name="TOC11" href="#SEC11">INTERNAL OPTION SETTING</a> +<li><a name="TOC12" href="#SEC12">SUBPATTERNS</a> +<li><a name="TOC13" href="#SEC13">DUPLICATE SUBPATTERN NUMBERS</a> +<li><a name="TOC14" href="#SEC14">NAMED SUBPATTERNS</a> +<li><a name="TOC15" href="#SEC15">REPETITION</a> +<li><a name="TOC16" href="#SEC16">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> +<li><a name="TOC17" href="#SEC17">BACK REFERENCES</a> +<li><a name="TOC18" href="#SEC18">ASSERTIONS</a> +<li><a name="TOC19" href="#SEC19">CONDITIONAL SUBPATTERNS</a> +<li><a name="TOC20" href="#SEC20">COMMENTS</a> +<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a> +<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a> +<li><a name="TOC23" href="#SEC23">CALLOUTS</a> +<li><a name="TOC24" href="#SEC24">BACTRACKING CONTROL</a> +<li><a name="TOC25" href="#SEC25">SEE ALSO</a> +<li><a name="TOC26" href="#SEC26">AUTHOR</a> +<li><a name="TOC27" href="#SEC27">REVISION</a> </ul> <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> <P> @@ -74,7 +75,39 @@ discussed in the <a href="pcrematching.html"><b>pcrematching</b></a> page. </P> -<br><a name="SEC2" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> +<br><a name="SEC2" href="#TOC1">NEWLINE CONVENTIONS</a><br> +<P> +PCRE supports five different conventions for indicating line breaks in +strings: a single CR (carriage return) character, a single LF (linefeed) +character, the two-character sequence CRLF, any of the three preceding, or any +Unicode newline sequence. The +<a href="pcreapi.html"><b>pcreapi</b></a> +page has +<a href="pcreapi.html#newlines">further discussion</a> +about newlines, and shows how to set the newline convention in the +<i>options</i> arguments for the compiling and matching functions. +</P> +<P> +It is also possible to specify a newline convention by starting a pattern +string with one of the following five sequences: +<pre> + (*CR) carriage return + (*LF) linefeed + (*CRLF) carriage return, followed by linefeed + (*ANYCRLF) any of the three above + (*ANY) all Unicode newline sequences +</pre> +These override the default and the options given to <b>pcre_compile()</b>. For +example, on a Unix system where LF is the default newline sequence, the pattern +<pre> + (*CR)a.b +</pre> +changes the convention to CR. That pattern matches "a\nb" because LF is no +longer a newline. Note that these special settings, which are not +Perl-compatible, are recognized only at the very start of a pattern, and that +they must be in upper case. +</P> +<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> <P> A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the @@ -131,7 +164,7 @@ a character class the only metacharacters are: </pre> The following sections describe the use of each of the metacharacters. </P> -<br><a name="SEC3" href="#TOC1">BACKSLASH</a><br> +<br><a name="SEC4" href="#TOC1">BACKSLASH</a><br> <P> The backslash character has several uses. Firstly, if it is followed by a non-alphanumeric character, it takes away any special meaning that character @@ -180,7 +213,7 @@ represents: \cx "control-x", where x is any character \e escape (hex 1B) \f formfeed (hex 0C) - \n newline (hex 0A) + \n linefeed (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \ddd character with octal code ddd, or backreference @@ -675,7 +708,7 @@ If all the alternatives of a pattern begin with \G, the expression is anchored to the starting match position, and the "anchored" flag is set in the compiled regular expression. </P> -<br><a name="SEC4" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> +<br><a name="SEC5" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> <P> Outside a character class, in the default matching mode, the circumflex character is an assertion that is true only if the current matching point is @@ -729,7 +762,7 @@ Note that the sequences \A, \Z, and \z can be used to match the start and end of the subject in both modes, and if all branches of a pattern start with \A it is always anchored, whether or not PCRE_MULTILINE is set. </P> -<br><a name="SEC5" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br> +<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br> <P> Outside a character class, a dot in the pattern matches any one character in the subject string except (by default) a character that signifies the end of a @@ -754,7 +787,7 @@ The handling of dot is entirely independent of the handling of circumflex and dollar, the only relationship being that they both involve newlines. Dot has no special meaning in a character class. </P> -<br><a name="SEC6" href="#TOC1">MATCHING A SINGLE BYTE</a><br> +<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> <P> Outside a character class, the escape sequence \C matches any one byte, both in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending @@ -769,7 +802,7 @@ PCRE does not allow \C to appear in lookbehind assertions because in UTF-8 mode this would make it impossible to calculate the length of the lookbehind. <a name="characterclass"></a></P> -<br><a name="SEC7" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> +<br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> <P> An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a @@ -864,7 +897,7 @@ introducing a POSIX class name - see the next section), and the terminating closing square bracket. However, escaping other non-alphanumeric characters does no harm. </P> -<br><a name="SEC8" href="#TOC1">POSIX CHARACTER CLASSES</a><br> +<br><a name="SEC9" href="#TOC1">POSIX CHARACTER CLASSES</a><br> <P> Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE also supports @@ -910,7 +943,7 @@ supported, and an error is given if they are encountered. In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes. </P> -<br><a name="SEC9" href="#TOC1">VERTICAL BAR</a><br> +<br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br> <P> Vertical bar characters are used to separate alternative patterns. For example, the pattern @@ -925,7 +958,7 @@ that succeeds is used. If the alternatives are within a subpattern "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern. </P> -<br><a name="SEC10" href="#TOC1">INTERNAL OPTION SETTING</a><br> +<br><a name="SEC11" href="#TOC1">INTERNAL OPTION SETTING</a><br> <P> The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED options can be changed from within the pattern by a sequence of @@ -973,7 +1006,7 @@ The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be changed in the same way as the Perl-compatible options by using the characters J, U and X respectively. <a name="subpattern"></a></P> -<br><a name="SEC11" href="#TOC1">SUBPATTERNS</a><br> +<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> <P> Subpatterns are delimited by parentheses (round brackets), which can be nested. Turning part of a pattern into a subpattern does two things: @@ -1027,7 +1060,7 @@ from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday". </P> -<br><a name="SEC12" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> +<br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> <P> Perl 5.10 introduced a feature whereby each alternative in a subpattern uses the same numbers for its capturing parentheses. Such a subpattern starts with @@ -1058,7 +1091,7 @@ the first one in the pattern with the given number. An alternative approach to using this "branch reset" feature is to use duplicate named subpatterns, as described in the next section. </P> -<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br> +<br><a name="SEC14" href="#TOC1">NAMED SUBPATTERNS</a><br> <P> Identifying capturing parentheses by number is simple, but it can be very hard to keep track of the numbers in complicated regular expressions. Furthermore, @@ -1113,7 +1146,7 @@ details of the interfaces for handling named subpatterns, see the <a href="pcreapi.html"><b>pcreapi</b></a> documentation. </P> -<br><a name="SEC14" href="#TOC1">REPETITION</a><br> +<br><a name="SEC15" href="#TOC1">REPETITION</a><br> <P> Repetition is specified by quantifiers, which can follow any of the following items: @@ -1264,7 +1297,7 @@ example, after </pre> matches "aba" the value of the second captured substring is "b". <a name="atomicgroup"></a></P> -<br><a name="SEC15" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> +<br><a name="SEC16" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> <P> With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") repetition, failure of what follows normally causes the repeated item to be @@ -1368,7 +1401,7 @@ an atomic group, like this: </pre> sequences of non-digits cannot be broken, and failure happens quickly. <a name="backreferences"></a></P> -<br><a name="SEC16" href="#TOC1">BACK REFERENCES</a><br> +<br><a name="SEC17" href="#TOC1">BACK REFERENCES</a><br> <P> Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier @@ -1482,7 +1515,7 @@ that the first iteration does not need to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero. <a name="bigassertions"></a></P> -<br><a name="SEC17" href="#TOC1">ASSERTIONS</a><br> +<br><a name="SEC18" href="#TOC1">ASSERTIONS</a><br> <P> An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple @@ -1642,7 +1675,7 @@ preceded by "foo", while is another pattern that matches "foo" preceded by three digits and any three characters that are not "999". <a name="conditions"></a></P> -<br><a name="SEC18" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> +<br><a name="SEC19" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> <P> It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on @@ -1780,7 +1813,7 @@ subject is matched against the first alternative; otherwise it is matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. <a name="comments"></a></P> -<br><a name="SEC19" href="#TOC1">COMMENTS</a><br> +<br><a name="SEC20" href="#TOC1">COMMENTS</a><br> <P> The sequence (?# marks the start of a comment that continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters @@ -1791,7 +1824,7 @@ If the PCRE_EXTENDED option is set, an unescaped # character outside a character class introduces a comment that continues to immediately after the next newline in the pattern. <a name="recursion"></a></P> -<br><a name="SEC20" href="#TOC1">RECURSIVE PATTERNS</a><br> +<br><a name="SEC21" href="#TOC1">RECURSIVE PATTERNS</a><br> <P> Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can @@ -1921,7 +1954,7 @@ In this pattern, (?(R) is the start of a conditional subpattern, with two different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call. <a name="subpatternsassubroutines"></a></P> -<br><a name="SEC21" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> +<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> <P> If the syntax for a recursive subpattern reference (either by number or by name) is used outside the parentheses to which it refers, it operates like a @@ -1961,7 +1994,7 @@ changed for different calls. For example, consider this pattern: It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern. </P> -<br><a name="SEC22" href="#TOC1">CALLOUTS</a><br> +<br><a name="SEC23" href="#TOC1">CALLOUTS</a><br> <P> Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -1996,7 +2029,7 @@ description of the interface to the callout function is given in the <a href="pcrecallout.html"><b>pcrecallout</b></a> documentation. </P> -<br><a name="SEC23" href="#TOC1">BACTRACKING CONTROL</a><br> +<br><a name="SEC24" href="#TOC1">BACTRACKING CONTROL</a><br> <P> Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which are described in the Perl documentation as "experimental and subject to change @@ -2111,11 +2144,11 @@ the end of the group if FOO succeeds); on failure the matcher skips to the second alternative and tries COND2, without backtracking into COND1. If (*THEN) is used outside of any alternation, it acts exactly like (*PRUNE). </P> -<br><a name="SEC24" href="#TOC1">SEE ALSO</a><br> +<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br> <P> <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). </P> -<br><a name="SEC25" href="#TOC1">AUTHOR</a><br> +<br><a name="SEC26" href="#TOC1">AUTHOR</a><br> <P> Philip Hazel <br> @@ -2124,9 +2157,9 @@ University Computing Service Cambridge CB2 3QH, England. <br> </P> -<br><a name="SEC26" href="#TOC1">REVISION</a><br> +<br><a name="SEC27" href="#TOC1">REVISION</a><br> <P> -Last updated: 09 August 2007 +Last updated: 21 August 2007 <br> Copyright © 1997-2007 University of Cambridge. <br> |