diff options
author | Karl Williamson <public@khwilliamson.com> | 2013-09-12 19:42:51 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2013-09-24 11:36:11 -0600 |
commit | c8849eb1c2c14b1c9a128a8f8a696ae1eac43f63 (patch) | |
tree | 90c3275ec01c423695636c0b5883dde3eeee3567 /pod/perlreguts.pod | |
parent | bd650281b855f8970df792fc7c58cffd5bfe272e (diff) | |
download | perl-c8849eb1c2c14b1c9a128a8f8a696ae1eac43f63.tar.gz |
perlreguts: Bring up-to-date
Various changes have been made to regcomp.c that didn't make it into
perlreguts until now.
Diffstat (limited to 'pod/perlreguts.pod')
-rw-r--r-- | pod/perlreguts.pod | 48 |
1 files changed, 12 insertions, 36 deletions
diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod index 039b48c82d..d93c799ce6 100644 --- a/pod/perlreguts.pod +++ b/pod/perlreguts.pod @@ -168,23 +168,29 @@ multiple of four bytes: =item C<regnode_charclass> -Bracketed character classes are represented by C<regnode_charclass> structures, -which have a four-byte argument and then a 32-byte (256-bit) bitmap -indicating which characters are included in the class. +Bracketed character classes are represented by C<regnode_charclass> +structures, which have a four-byte argument and then a 32-byte (256-bit) +bitmap indicating which characters in the Latin1 range are included in +the class. regnode_charclass U32 arg1; char bitmap[ANYOF_BITMAP_SIZE]; +Various flags whose names begin with C<ANYOF_> are used for special +situations. Above Latin1 matches and things not known until run-time +are stored in L</Perl's pprivate structure>. + =item C<regnode_charclass_class> There is also a larger form of a char class structure used to represent -POSIX char classes called C<regnode_charclass_class> which has an -additional 4-byte (32-bit) bitmap indicating which POSIX char classes +POSIX char classes under C</l> matching, +called C<regnode_charclass_class> which has an +additional 32-bit bitmap indicating which POSIX char classes have been included. regnode_charclass_class U32 arg1; char bitmap[ANYOF_BITMAP_SIZE]; - char classflags[ANYOF_CLASSBITMAP_SIZE]; + U32 classflags; =back @@ -761,36 +767,6 @@ Care must be taken when making changes to make sure that you handle UTF-8 properly, both at compile time and at execution time, including when the string and pattern are mismatched. -The following comment in F<regcomp.h> gives an example of exactly how -tricky this can be: - - Two problematic code points in Unicode casefolding of EXACT nodes: - - U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS - U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS - - which casefold to - - Unicode UTF-8 - - U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81 - U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81 - - This means that in case-insensitive matching (or "loose matching", - as Unicode calls it), an EXACTF of length six (the UTF-8 encoded - byte length of the above casefolded versions) can match a target - string of length two (the byte length of UTF-8 encoded U+0390 or - U+03B0). This would rather mess up the minimum length computation. - - What we'll do is to look for the tail four bytes, and then peek - at the preceding two bytes to see whether we need to decrease - the minimum length by four (six minus two). - - Thanks to the design of UTF-8, there cannot be false matches: - A sequence of valid UTF-8 bytes cannot be a subsequence of - another valid sequence of UTF-8 bytes. - - =head2 Base Structures The C<regexp> structure described in L<perlreapi> is common to all |