diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2002-04-20 01:46:03 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2002-04-20 01:46:03 +0000 |
commit | 491fd90a109f6263a896300e5709e6fd255f075f (patch) | |
tree | 596ceeddf227da61927d12e4c2ce4c324fc43bbd /pod/perlunicode.pod | |
parent | ee081dd1f02934d943364e5d6bd4130bf9c3e0ad (diff) | |
download | perl-491fd90a109f6263a896300e5709e6fd255f075f.tar.gz |
User-defined character properties were unintentionally
removed, noticed by Dan Kogai.
p4raw-id: //depot/perl@16012
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 80 |
1 files changed, 80 insertions, 0 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index af79344402..46080430a7 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -615,6 +615,86 @@ And finally, C<scalar reverse()> reverses by character rather than by byte. =back +=head2 Defining your own character properties + +You can define your own character properties by defining subroutines +that have names beginning with "In" or "Is". The subroutines must be +visible in the package that uses the properties. The user-defined +properties can be used in the regular expression C<\p> and C<\P> +constructs. + +The subroutines must return a specially formatted string: one or more +newline-separated lines. Each line must be one of the following: + +=over 4 + +=item * + +Two hexadecimal numbers separated by a tabulator denoting a range +of Unicode codepoints. + +=item * + +An existing character property prefixed by "+utf8::" to include +all the characters in that property. + +=item * + +An existing character property prefixed by "-utf8::" to exclude +all the characters in that property. + +=item * + +An existing character property prefixed by "!utf8::" to include +all except the characters in that property. + +=back + +For example, to define a property that covers both the Japanese +syllabaries (hiragana and katakana), you can define + + sub InKana { + return <<'END'; + 3040 309F + 30A0 30FF + END + } + +Imagine that the here-doc end marker is at the beginning of the line, +and that the hexadecimal numbers are separated by a tabulator. +Now you can use C<\p{InKana}> and C<\P{IsKana}>. + +You could also have used the existing block property names: + + sub InKana { + return <<'END'; + +utf8::InHiragana + +utf8::InKatakana + END + } + +Suppose you wanted to match only the allocated characters, +not the by raw block ranges: in other words, you want to remove +the non-characters: + + sub InKana { + return <<'END'; + +utf8::InHiragana + +utf8::InKatakana + -utf8::IsCn + END + } + +The negation is useful for defining (surprise!) negated classes. + + sub InNotKana { + return <<'END'; + !utf8::InHiragana + -utf8::InKatakana + +utf8::IsCn + END + } + =head2 Character encodings for input and output See L<Encode>. |