summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod95
1 files changed, 62 insertions, 33 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 7558b3260b..cfe44f6a22 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -985,34 +985,33 @@ Level 1 - Basic Unicode Support
RL1.6 Line Boundaries - MISSING [8][9]
RL1.7 Supplementary Code Points - done [10]
- [1] \x{...}
- [2] \p{...} \P{...}
- [3] supports not only minimal list, but all Unicode character
- properties (see Unicode Character Properties above)
- [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
- [5] can use regular expression look-ahead [a] or
- user-defined character properties [b] to emulate set
- operations
- [6] \b \B
- [7] note that Perl does Full case-folding in matching (but with
- bugs), not Simple: for example U+1F88 is equivalent to
- U+1F00 U+03B9, instead of just U+1F80. This difference
- matters mainly for certain Greek capital letters with certain
- modifiers: the Full case-folding decomposes the letter,
- while the Simple case-folding would map it to a single
- character.
- [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR
- (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
- (U+2029); should also affect <>, $., and script line
- numbers; should not split lines within CRLF [c] (i.e. there
- is no empty line between \r and \n)
- [9] Linebreaking conformant with UAX#14 "Unicode Line Breaking
- Algorithm" is available through the Unicode::LineBreaking
- module.
- [10] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
- U+10FFFF but also beyond U+10FFFF
-
-[a] You can mimic class subtraction using lookahead.
+=over 4
+
+=item [1]
+
+\x{...}
+
+=item [2]
+
+\p{...} \P{...}
+
+=item [3]
+
+supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
+
+=item [4]
+
+\d \D \s \S \w \W \X [:prop:] [:^prop:]
+
+=item [5]
+
+ Can use the following to emulate set operations:
+
+=over 4
+
+=item * Regular expression look-ahead
+
+You can mimic class subtraction using lookahead.
For example, what UTS#18 might write as
[{Block=Greek}-[{UNASSIGNED}]]
@@ -1028,13 +1027,43 @@ But in this particular example, you probably really want
which will match assigned characters known to be part of the Greek script.
-Also see the L<Unicode::Regex::Set> module; it does implement the full
-UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
+=item * CPAN module L<Unicode::Regex::Set>
-[b] '+' for union, '-' for removal (set-difference), '&' for intersection
-(see L</"User-Defined Character Properties">)
+It does implement the full UTS#18 grouping, intersection, union, and
+removal (subtraction) syntax.
-[c] Try the C<:crlf> layer (see L<PerlIO>).
+=item * L</"User-Defined Character Properties">
+
+'+' for union, '-' for removal (set-difference), '&' for intersection
+
+=back
+
+=item [6]
+
+\b \B
+
+=item [7]
+
+Note that Perl does Full case-folding in matching (but with bugs), not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, instead of just U+1F80. This difference matters mainly for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character.
+
+=item [8]
+
+Should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), CRLF
+(\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); should also affect
+<>, $., and script line numbers; should not split lines within CRLF
+(i.e. there is no empty line between \r and \n). For CRLF, try the
+C<:crlf> layer (see L<PerlIO>).
+
+=item [9]
+
+Linebreaking conformant with UAX#14 "Unicode Line Breaking Algorithm" is available through the Unicode::LineBreaking module.
+
+=item [10]
+
+UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to
+U+10FFFF but also beyond U+10FFFF
+
+=back
=item *