summaryrefslogtreecommitdiff
path: root/pod/perlreguts.pod
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2013-09-12 19:42:51 -0600
committerKarl Williamson <public@khwilliamson.com>2013-09-24 11:36:11 -0600
commitc8849eb1c2c14b1c9a128a8f8a696ae1eac43f63 (patch)
tree90c3275ec01c423695636c0b5883dde3eeee3567 /pod/perlreguts.pod
parentbd650281b855f8970df792fc7c58cffd5bfe272e (diff)
downloadperl-c8849eb1c2c14b1c9a128a8f8a696ae1eac43f63.tar.gz
perlreguts: Bring up-to-date
Various changes have been made to regcomp.c that didn't make it into perlreguts until now.
Diffstat (limited to 'pod/perlreguts.pod')
-rw-r--r--pod/perlreguts.pod48
1 files changed, 12 insertions, 36 deletions
diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod
index 039b48c82d..d93c799ce6 100644
--- a/pod/perlreguts.pod
+++ b/pod/perlreguts.pod
@@ -168,23 +168,29 @@ multiple of four bytes:
=item C<regnode_charclass>
-Bracketed character classes are represented by C<regnode_charclass> structures,
-which have a four-byte argument and then a 32-byte (256-bit) bitmap
-indicating which characters are included in the class.
+Bracketed character classes are represented by C<regnode_charclass>
+structures, which have a four-byte argument and then a 32-byte (256-bit)
+bitmap indicating which characters in the Latin1 range are included in
+the class.
regnode_charclass U32 arg1;
char bitmap[ANYOF_BITMAP_SIZE];
+Various flags whose names begin with C<ANYOF_> are used for special
+situations. Above Latin1 matches and things not known until run-time
+are stored in L</Perl's pprivate structure>.
+
=item C<regnode_charclass_class>
There is also a larger form of a char class structure used to represent
-POSIX char classes called C<regnode_charclass_class> which has an
-additional 4-byte (32-bit) bitmap indicating which POSIX char classes
+POSIX char classes under C</l> matching,
+called C<regnode_charclass_class> which has an
+additional 32-bit bitmap indicating which POSIX char classes
have been included.
regnode_charclass_class U32 arg1;
char bitmap[ANYOF_BITMAP_SIZE];
- char classflags[ANYOF_CLASSBITMAP_SIZE];
+ U32 classflags;
=back
@@ -761,36 +767,6 @@ Care must be taken when making changes to make sure that you handle
UTF-8 properly, both at compile time and at execution time, including
when the string and pattern are mismatched.
-The following comment in F<regcomp.h> gives an example of exactly how
-tricky this can be:
-
- Two problematic code points in Unicode casefolding of EXACT nodes:
-
- U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
- U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
-
- which casefold to
-
- Unicode UTF-8
-
- U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81
- U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81
-
- This means that in case-insensitive matching (or "loose matching",
- as Unicode calls it), an EXACTF of length six (the UTF-8 encoded
- byte length of the above casefolded versions) can match a target
- string of length two (the byte length of UTF-8 encoded U+0390 or
- U+03B0). This would rather mess up the minimum length computation.
-
- What we'll do is to look for the tail four bytes, and then peek
- at the preceding two bytes to see whether we need to decrease
- the minimum length by four (six minus two).
-
- Thanks to the design of UTF-8, there cannot be false matches:
- A sequence of valid UTF-8 bytes cannot be a subsequence of
- another valid sequence of UTF-8 bytes.
-
-
=head2 Base Structures
The C<regexp> structure described in L<perlreapi> is common to all