summaryrefslogtreecommitdiff
path: root/ext/pcre/pcrelib/doc/Tech.Notes
diff options
context:
space:
mode:
Diffstat (limited to 'ext/pcre/pcrelib/doc/Tech.Notes')
-rw-r--r--ext/pcre/pcrelib/doc/Tech.Notes35
1 files changed, 28 insertions, 7 deletions
diff --git a/ext/pcre/pcrelib/doc/Tech.Notes b/ext/pcre/pcrelib/doc/Tech.Notes
index dd01932f8d..73c31c7ca1 100644
--- a/ext/pcre/pcrelib/doc/Tech.Notes
+++ b/ext/pcre/pcrelib/doc/Tech.Notes
@@ -48,7 +48,9 @@ These items are all just one byte long
OP_END end of pattern
OP_ANY match any character
+ OP_ANYBYTE match any single byte, even in UTF-8 mode
OP_SOD match start of data: \A
+ OP_SOM, start of match (subject + offset): \G
OP_CIRC ^ (start of data, or after \n in multiline)
OP_NOT_WORD_BOUNDARY \W
OP_WORD_BOUNDARY \w
@@ -61,7 +63,6 @@ These items are all just one byte long
OP_EODN match end of data or \n at end: \Z
OP_EOD match end of data: \z
OP_DOLL $ (end of data, or before \n in multiline)
- OP_RECURSE match the pattern recursively
Repeating single characters
@@ -119,8 +120,7 @@ instances of OP_CHARS are used.
Character classes
-----------------
-When characters less than 256 are involved, OP_CLASS is used for a character
-class. If there is only one character, OP_CHARS is used for a positive class,
+If there is only one character, OP_CHARS is used for a positive class,
and OP_NOT for a negative one (that is, for something like [^a]). However, in
UTF-8 mode, this applies only to characters with values < 128, because OP_NOT
is confined to single bytes.
@@ -129,9 +129,15 @@ Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
negated, single-character class. The normal ones (OP_STAR etc.) are used for a
repeated positive single-character class.
-OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every
-character that is acceptable. The bits are counted from the least significant
-end of each byte.
+When there's more than one character in a class and all the characters are less
+than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative
+one. In either case, the opcode is followed by a 32-byte bit map containing a 1
+bit for every character that is acceptable. The bits are counted from the least
+significant end of each byte.
+
+The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
+subject characters with values greater than 256 can be handled correctly. For
+OP_CLASS they don't match, whereas for OP_NCLASS they do.
For classes containing characters with values > 255, OP_XCLASS is used. It
optionally uses a bit map (if any characters lie within it), followed by a list
@@ -243,6 +249,21 @@ same scheme is used, with a "reference number" of 0xffff. Otherwise, a
conditional subpattern always starts with one of the assertions.
+Recursion
+---------
+
+Recursion either matches the current regex, or some subexpression. The opcode
+OP_RECURSE is followed by an value which is the offset to the starting bracket
+from the start of the whole pattern.
+
+
+Callout
+-------
+
+OP_CALLOUT is followed by one byte of data that holds a callout number in the
+range 0 to 255.
+
+
Changing options
----------------
@@ -257,4 +278,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
-August 2002
+August 2003