summaryrefslogtreecommitdiff
path: root/doc/html/pcrepattern.html
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-08-21 15:00:15 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-08-21 15:00:15 +0000
commit273487b8386264c012ab681035d19c93b4309ed3 (patch)
tree8706cad6dc3ba3dcfa9971357c00a725e8029244 /doc/html/pcrepattern.html
parentc6a88bf880d462c62e00d8d7c3eeeaad60ebab49 (diff)
downloadpcre-273487b8386264c012ab681035d19c93b4309ed3.tar.gz
Add (*CR) etc.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@227 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/html/pcrepattern.html')
-rw-r--r--doc/html/pcrepattern.html137
1 files changed, 85 insertions, 52 deletions
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 45d8181..d9847d5 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -14,31 +14,32 @@ man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
-<li><a name="TOC2" href="#SEC2">CHARACTERS AND METACHARACTERS</a>
-<li><a name="TOC3" href="#SEC3">BACKSLASH</a>
-<li><a name="TOC4" href="#SEC4">CIRCUMFLEX AND DOLLAR</a>
-<li><a name="TOC5" href="#SEC5">FULL STOP (PERIOD, DOT)</a>
-<li><a name="TOC6" href="#SEC6">MATCHING A SINGLE BYTE</a>
-<li><a name="TOC7" href="#SEC7">SQUARE BRACKETS AND CHARACTER CLASSES</a>
-<li><a name="TOC8" href="#SEC8">POSIX CHARACTER CLASSES</a>
-<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a>
-<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a>
-<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a>
-<li><a name="TOC12" href="#SEC12">DUPLICATE SUBPATTERN NUMBERS</a>
-<li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a>
-<li><a name="TOC14" href="#SEC14">REPETITION</a>
-<li><a name="TOC15" href="#SEC15">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
-<li><a name="TOC16" href="#SEC16">BACK REFERENCES</a>
-<li><a name="TOC17" href="#SEC17">ASSERTIONS</a>
-<li><a name="TOC18" href="#SEC18">CONDITIONAL SUBPATTERNS</a>
-<li><a name="TOC19" href="#SEC19">COMMENTS</a>
-<li><a name="TOC20" href="#SEC20">RECURSIVE PATTERNS</a>
-<li><a name="TOC21" href="#SEC21">SUBPATTERNS AS SUBROUTINES</a>
-<li><a name="TOC22" href="#SEC22">CALLOUTS</a>
-<li><a name="TOC23" href="#SEC23">BACTRACKING CONTROL</a>
-<li><a name="TOC24" href="#SEC24">SEE ALSO</a>
-<li><a name="TOC25" href="#SEC25">AUTHOR</a>
-<li><a name="TOC26" href="#SEC26">REVISION</a>
+<li><a name="TOC2" href="#SEC2">NEWLINE CONVENTIONS</a>
+<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a>
+<li><a name="TOC4" href="#SEC4">BACKSLASH</a>
+<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a>
+<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT)</a>
+<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a>
+<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a>
+<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a>
+<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a>
+<li><a name="TOC11" href="#SEC11">INTERNAL OPTION SETTING</a>
+<li><a name="TOC12" href="#SEC12">SUBPATTERNS</a>
+<li><a name="TOC13" href="#SEC13">DUPLICATE SUBPATTERN NUMBERS</a>
+<li><a name="TOC14" href="#SEC14">NAMED SUBPATTERNS</a>
+<li><a name="TOC15" href="#SEC15">REPETITION</a>
+<li><a name="TOC16" href="#SEC16">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
+<li><a name="TOC17" href="#SEC17">BACK REFERENCES</a>
+<li><a name="TOC18" href="#SEC18">ASSERTIONS</a>
+<li><a name="TOC19" href="#SEC19">CONDITIONAL SUBPATTERNS</a>
+<li><a name="TOC20" href="#SEC20">COMMENTS</a>
+<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a>
+<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a>
+<li><a name="TOC23" href="#SEC23">CALLOUTS</a>
+<li><a name="TOC24" href="#SEC24">BACTRACKING CONTROL</a>
+<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
+<li><a name="TOC26" href="#SEC26">AUTHOR</a>
+<li><a name="TOC27" href="#SEC27">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
<P>
@@ -74,7 +75,39 @@ discussed in the
<a href="pcrematching.html"><b>pcrematching</b></a>
page.
</P>
-<br><a name="SEC2" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
+<br><a name="SEC2" href="#TOC1">NEWLINE CONVENTIONS</a><br>
+<P>
+PCRE supports five different conventions for indicating line breaks in
+strings: a single CR (carriage return) character, a single LF (linefeed)
+character, the two-character sequence CRLF, any of the three preceding, or any
+Unicode newline sequence. The
+<a href="pcreapi.html"><b>pcreapi</b></a>
+page has
+<a href="pcreapi.html#newlines">further discussion</a>
+about newlines, and shows how to set the newline convention in the
+<i>options</i> arguments for the compiling and matching functions.
+</P>
+<P>
+It is also possible to specify a newline convention by starting a pattern
+string with one of the following five sequences:
+<pre>
+ (*CR) carriage return
+ (*LF) linefeed
+ (*CRLF) carriage return, followed by linefeed
+ (*ANYCRLF) any of the three above
+ (*ANY) all Unicode newline sequences
+</pre>
+These override the default and the options given to <b>pcre_compile()</b>. For
+example, on a Unix system where LF is the default newline sequence, the pattern
+<pre>
+ (*CR)a.b
+</pre>
+changes the convention to CR. That pattern matches "a\nb" because LF is no
+longer a newline. Note that these special settings, which are not
+Perl-compatible, are recognized only at the very start of a pattern, and that
+they must be in upper case.
+</P>
+<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
<P>
A regular expression is a pattern that is matched against a subject string from
left to right. Most characters stand for themselves in a pattern, and match the
@@ -131,7 +164,7 @@ a character class the only metacharacters are:
</pre>
The following sections describe the use of each of the metacharacters.
</P>
-<br><a name="SEC3" href="#TOC1">BACKSLASH</a><br>
+<br><a name="SEC4" href="#TOC1">BACKSLASH</a><br>
<P>
The backslash character has several uses. Firstly, if it is followed by a
non-alphanumeric character, it takes away any special meaning that character
@@ -180,7 +213,7 @@ represents:
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
- \n newline (hex 0A)
+ \n linefeed (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or backreference
@@ -675,7 +708,7 @@ If all the alternatives of a pattern begin with \G, the expression is anchored
to the starting match position, and the "anchored" flag is set in the compiled
regular expression.
</P>
-<br><a name="SEC4" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
+<br><a name="SEC5" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
<P>
Outside a character class, in the default matching mode, the circumflex
character is an assertion that is true only if the current matching point is
@@ -729,7 +762,7 @@ Note that the sequences \A, \Z, and \z can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\A it is always anchored, whether or not PCRE_MULTILINE is set.
</P>
-<br><a name="SEC5" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
+<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
<P>
Outside a character class, a dot in the pattern matches any one character in
the subject string except (by default) a character that signifies the end of a
@@ -754,7 +787,7 @@ The handling of dot is entirely independent of the handling of circumflex and
dollar, the only relationship being that they both involve newlines. Dot has no
special meaning in a character class.
</P>
-<br><a name="SEC6" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
+<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
<P>
Outside a character class, the escape sequence \C matches any one byte, both
in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
@@ -769,7 +802,7 @@ PCRE does not allow \C to appear in lookbehind assertions
because in UTF-8 mode this would make it impossible to calculate the length of
the lookbehind.
<a name="characterclass"></a></P>
-<br><a name="SEC7" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
+<br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
<P>
An opening square bracket introduces a character class, terminated by a closing
square bracket. A closing square bracket on its own is not special. If a
@@ -864,7 +897,7 @@ introducing a POSIX class name - see the next section), and the terminating
closing square bracket. However, escaping other non-alphanumeric characters
does no harm.
</P>
-<br><a name="SEC8" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
+<br><a name="SEC9" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
<P>
Perl supports the POSIX notation for character classes. This uses names
enclosed by [: and :] within the enclosing square brackets. PCRE also supports
@@ -910,7 +943,7 @@ supported, and an error is given if they are encountered.
In UTF-8 mode, characters with values greater than 128 do not match any of
the POSIX character classes.
</P>
-<br><a name="SEC9" href="#TOC1">VERTICAL BAR</a><br>
+<br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br>
<P>
Vertical bar characters are used to separate alternative patterns. For example,
the pattern
@@ -925,7 +958,7 @@ that succeeds is used. If the alternatives are within a subpattern
"succeeds" means matching the rest of the main pattern as well as the
alternative in the subpattern.
</P>
-<br><a name="SEC10" href="#TOC1">INTERNAL OPTION SETTING</a><br>
+<br><a name="SEC11" href="#TOC1">INTERNAL OPTION SETTING</a><br>
<P>
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options can be changed from within the pattern by a sequence of
@@ -973,7 +1006,7 @@ The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
changed in the same way as the Perl-compatible options by using the characters
J, U and X respectively.
<a name="subpattern"></a></P>
-<br><a name="SEC11" href="#TOC1">SUBPATTERNS</a><br>
+<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>
<P>
Subpatterns are delimited by parentheses (round brackets), which can be nested.
Turning part of a pattern into a subpattern does two things:
@@ -1027,7 +1060,7 @@ from left to right, and options are not reset until the end of the subpattern
is reached, an option setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
</P>
-<br><a name="SEC12" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
+<br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
<P>
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
the same numbers for its capturing parentheses. Such a subpattern starts with
@@ -1058,7 +1091,7 @@ the first one in the pattern with the given number.
An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
</P>
-<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>
+<br><a name="SEC14" href="#TOC1">NAMED SUBPATTERNS</a><br>
<P>
Identifying capturing parentheses by number is simple, but it can be very hard
to keep track of the numbers in complicated regular expressions. Furthermore,
@@ -1113,7 +1146,7 @@ details of the interfaces for handling named subpatterns, see the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
</P>
-<br><a name="SEC14" href="#TOC1">REPETITION</a><br>
+<br><a name="SEC15" href="#TOC1">REPETITION</a><br>
<P>
Repetition is specified by quantifiers, which can follow any of the following
items:
@@ -1264,7 +1297,7 @@ example, after
</pre>
matches "aba" the value of the second captured substring is "b".
<a name="atomicgroup"></a></P>
-<br><a name="SEC15" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
+<br><a name="SEC16" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
<P>
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
repetition, failure of what follows normally causes the repeated item to be
@@ -1368,7 +1401,7 @@ an atomic group, like this:
</pre>
sequences of non-digits cannot be broken, and failure happens quickly.
<a name="backreferences"></a></P>
-<br><a name="SEC16" href="#TOC1">BACK REFERENCES</a><br>
+<br><a name="SEC17" href="#TOC1">BACK REFERENCES</a><br>
<P>
Outside a character class, a backslash followed by a digit greater than 0 (and
possibly further digits) is a back reference to a capturing subpattern earlier
@@ -1482,7 +1515,7 @@ that the first iteration does not need to match the back reference. This can be
done using alternation, as in the example above, or by a quantifier with a
minimum of zero.
<a name="bigassertions"></a></P>
-<br><a name="SEC17" href="#TOC1">ASSERTIONS</a><br>
+<br><a name="SEC18" href="#TOC1">ASSERTIONS</a><br>
<P>
An assertion is a test on the characters following or preceding the current
matching point that does not actually consume any characters. The simple
@@ -1642,7 +1675,7 @@ preceded by "foo", while
is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
<a name="conditions"></a></P>
-<br><a name="SEC18" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
+<br><a name="SEC19" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
<P>
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns, depending on
@@ -1780,7 +1813,7 @@ subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
<a name="comments"></a></P>
-<br><a name="SEC19" href="#TOC1">COMMENTS</a><br>
+<br><a name="SEC20" href="#TOC1">COMMENTS</a><br>
<P>
The sequence (?# marks the start of a comment that continues up to the next
closing parenthesis. Nested parentheses are not permitted. The characters
@@ -1791,7 +1824,7 @@ If the PCRE_EXTENDED option is set, an unescaped # character outside a
character class introduces a comment that continues to immediately after the
next newline in the pattern.
<a name="recursion"></a></P>
-<br><a name="SEC20" href="#TOC1">RECURSIVE PATTERNS</a><br>
+<br><a name="SEC21" href="#TOC1">RECURSIVE PATTERNS</a><br>
<P>
Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best that can
@@ -1921,7 +1954,7 @@ In this pattern, (?(R) is the start of a conditional subpattern, with two
different alternatives for the recursive and non-recursive cases. The (?R) item
is the actual recursive call.
<a name="subpatternsassubroutines"></a></P>
-<br><a name="SEC21" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
+<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
If the syntax for a recursive subpattern reference (either by number or by
name) is used outside the parentheses to which it refers, it operates like a
@@ -1961,7 +1994,7 @@ changed for different calls. For example, consider this pattern:
It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
</P>
-<br><a name="SEC22" href="#TOC1">CALLOUTS</a><br>
+<br><a name="SEC23" href="#TOC1">CALLOUTS</a><br>
<P>
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it
@@ -1996,7 +2029,7 @@ description of the interface to the callout function is given in the
<a href="pcrecallout.html"><b>pcrecallout</b></a>
documentation.
</P>
-<br><a name="SEC23" href="#TOC1">BACTRACKING CONTROL</a><br>
+<br><a name="SEC24" href="#TOC1">BACTRACKING CONTROL</a><br>
<P>
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
are described in the Perl documentation as "experimental and subject to change
@@ -2111,11 +2144,11 @@ the end of the group if FOO succeeds); on failure the matcher skips to the
second alternative and tries COND2, without backtracking into COND1. If (*THEN)
is used outside of any alternation, it acts exactly like (*PRUNE).
</P>
-<br><a name="SEC24" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).
</P>
-<br><a name="SEC25" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@@ -2124,9 +2157,9 @@ University Computing Service
Cambridge CB2 3QH, England.
<br>
</P>
-<br><a name="SEC26" href="#TOC1">REVISION</a><br>
+<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 09 August 2007
+Last updated: 21 August 2007
<br>
Copyright &copy; 1997-2007 University of Cambridge.
<br>