diff options
author | Karl Williamson <public@khwilliamson.com> | 2013-01-10 17:06:04 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2013-01-11 11:50:38 -0700 |
commit | 9d1a5160ac870eccea399973eaa9f9e3020b0833 (patch) | |
tree | 9d08b87e7c229f41ee345da68b7de257a585a21b /pod/perlunicode.pod | |
parent | ab6629666cee2471e467421195a7a99662521188 (diff) | |
download | perl-9d1a5160ac870eccea399973eaa9f9e3020b0833.tar.gz |
New regex experimental feature: (?[ ])
This is a fancier [bracketed] character class which allows set
operations, such as intersection and subtraction. The entry in perlre
for this commit details its operation.
Besides extending regular expressions to handle this functionality,
recommended by Unicode, the intent here is to do three things:
1) Intersection has been simulated by regexes using zero-width
look-around assertions, which are non-obvious. This allows replacing
those with a more powerful and clearer syntax; the compiled regexes
are smaller and faster. Everything is known at compile time.
2) Set operations have also been simulated by using user-defined Unicode
properties. These are globals, have security implications,
restricted names, and d don't allow as complex expressions as this
new feature.
3) I hope that this feature will come to be viewed as a "better"
bracketed character class. I took advantage of the fact that there
is no embedded base to have to be compatibile with to forbid certain
iffy practices with the existing ones, while remaining mostly
backwards compatible. The main difference is that /x is always
enabled, so white space can be pretty much freely used with these,
but to specify a match on white space, it must be escaped. Things
that should have been illegal are, such as \x{}, and \x{abcdefghi}.
Things that look like a posix specifier but don't quite meet the
rules now give an error instead of silently compiling. e.g., [:digit]
is an error instead of the union of the characters that compose it.
I may have omitted things; perhaps it should be an error to have the
same letter occur twice, adjacent. Since this is experimental, we
can make such changes based on field feed back.
The intent is to keep this feature, since it is strongly recommended by
Unicode. The exact syntax is subject to change, so is experimental.
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 10 |
1 files changed, 7 insertions, 3 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index cfe44f6a22..86db3ecfcb 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -808,7 +808,9 @@ L<perlrecharclass/POSIX Character Classes>. =head2 User-Defined Character Properties You can define your own binary character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines can be defined in any +whose names begin with "In" or "Is". (The experimental feature +L<perlre/(?[ ])> provides an alternative which allows more complex +definitions.) The subroutines can be defined in any package. The user-defined properties can be used in the regular expression C<\p> and C<\P> constructs; if you are using a user-defined property from a package other than the one you are in, you must specify its package in the @@ -979,7 +981,7 @@ Level 1 - Basic Unicode Support RL1.1 Hex Notation - done [1] RL1.2 Properties - done [2][3] RL1.2a Compatibility Properties - done [4] - RL1.3 Subtraction and Intersection - MISSING [5] + RL1.3 Subtraction and Intersection - experimental [5] RL1.4 Simple Word Boundaries - done [6] RL1.5 Simple Loose Matches - done [7] RL1.6 Line Boundaries - MISSING [8][9] @@ -1005,7 +1007,9 @@ supports not only minimal list, but all Unicode character properties (see Unicod =item [5] - Can use the following to emulate set operations: +The experimental feature in v5.18 "(?[...])" accomplishes this. See +L<perlre/(?[ ])>. If you don't want to use an experimental feature, +you can use one of the following: =over 4 |