diff options
Diffstat (limited to 'ext/pcre/pcrelib/doc/Tech.Notes')
-rw-r--r-- | ext/pcre/pcrelib/doc/Tech.Notes | 35 |
1 files changed, 28 insertions, 7 deletions
diff --git a/ext/pcre/pcrelib/doc/Tech.Notes b/ext/pcre/pcrelib/doc/Tech.Notes index dd01932f8d..73c31c7ca1 100644 --- a/ext/pcre/pcrelib/doc/Tech.Notes +++ b/ext/pcre/pcrelib/doc/Tech.Notes @@ -48,7 +48,9 @@ These items are all just one byte long OP_END end of pattern OP_ANY match any character + OP_ANYBYTE match any single byte, even in UTF-8 mode OP_SOD match start of data: \A + OP_SOM, start of match (subject + offset): \G OP_CIRC ^ (start of data, or after \n in multiline) OP_NOT_WORD_BOUNDARY \W OP_WORD_BOUNDARY \w @@ -61,7 +63,6 @@ These items are all just one byte long OP_EODN match end of data or \n at end: \Z OP_EOD match end of data: \z OP_DOLL $ (end of data, or before \n in multiline) - OP_RECURSE match the pattern recursively Repeating single characters @@ -119,8 +120,7 @@ instances of OP_CHARS are used. Character classes ----------------- -When characters less than 256 are involved, OP_CLASS is used for a character -class. If there is only one character, OP_CHARS is used for a positive class, +If there is only one character, OP_CHARS is used for a positive class, and OP_NOT for a negative one (that is, for something like [^a]). However, in UTF-8 mode, this applies only to characters with values < 128, because OP_NOT is confined to single bytes. @@ -129,9 +129,15 @@ Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated, negated, single-character class. The normal ones (OP_STAR etc.) are used for a repeated positive single-character class. -OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every -character that is acceptable. The bits are counted from the least significant -end of each byte. +When there's more than one character in a class and all the characters are less +than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative +one. In either case, the opcode is followed by a 32-byte bit map containing a 1 +bit for every character that is acceptable. The bits are counted from the least +significant end of each byte. + +The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, +subject characters with values greater than 256 can be handled correctly. For +OP_CLASS they don't match, whereas for OP_NCLASS they do. For classes containing characters with values > 255, OP_XCLASS is used. It optionally uses a bit map (if any characters lie within it), followed by a list @@ -243,6 +249,21 @@ same scheme is used, with a "reference number" of 0xffff. Otherwise, a conditional subpattern always starts with one of the assertions. +Recursion +--------- + +Recursion either matches the current regex, or some subexpression. The opcode +OP_RECURSE is followed by an value which is the offset to the starting bracket +from the start of the whole pattern. + + +Callout +------- + +OP_CALLOUT is followed by one byte of data that holds a callout number in the +range 0 to 255. + + Changing options ---------------- @@ -257,4 +278,4 @@ at compile time, and so does not cause anything to be put into the compiled data. Philip Hazel -August 2002 +August 2003 |