summaryrefslogtreecommitdiff
path: root/pod/perlretut.pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-05-04 03:22:59 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-05-04 03:22:59 +0000
commit773ee785b5f3b7ef0dc8377e0c21172f01981c54 (patch)
tree216ecf32800727a3584838e0d14787a786415aa7 /pod/perlretut.pod
parentb4ccb3669c4ba4f143dd8c8c0db67436d390acba (diff)
downloadperl-773ee785b5f3b7ef0dc8377e0c21172f01981c54.tar.gz
Document the \pX and \p{Yz} (and \p{BidiXYZ}) classes a bit more.
p4raw-id: //depot/perl@9983
Diffstat (limited to 'pod/perlretut.pod')
-rw-r--r--pod/perlretut.pod70
1 files changed, 69 insertions, 1 deletions
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index fa6479c0c4..2960950065 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -1745,7 +1745,75 @@ You can also use the official Unicode class names with the C<\p> and
C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase
letters, or C<\P{Nd}> for non-digits. If a C<name> is just one
letter, the braces can be dropped. For instance, C<\pM> is the
-character class of Unicode 'marks'.
+character class of Unicode 'marks', for example accent marks.
+Here is the list as of Unicode 3.1.0 (the two-letter classes) and
+Perl 5.8.0 (the one-letter classes):
+
+ L Letter
+ Lu Letter, Uppercase
+ Ll Letter, Lowercase
+ Lt Letter, Titlecase
+ Lm Letter, Modifier
+ Lo Letter, Other
+ M Mark
+ Mn Mark, Non-Spacing
+ Mc Mark, Spacing Combining
+ Me Mark, Enclosing
+ N Number
+ Nd Number, Decimal Digit
+ Nl Number, Letter
+ No Number, Other
+ P Punctuation
+ Pc Punctuation, Connector
+ Pd Punctuation, Dash
+ Ps Punctuation, Open
+ Pe Punctuation, Close
+ Pi Punctuation, Initial quote
+ (may behave like Ps or Pe depending on usage)
+ Pf Punctuation, Final quote
+ (may behave like Ps or Pe depending on usage)
+ Po Punctuation, Other
+ S Symbol
+ Sm Symbol, Math
+ Sc Symbol, Currency
+ Sk Symbol, Modifier
+ So Symbol, Other
+ Z Separator
+ Zs Separator, Space
+ Zl Separator, Line
+ Zp Separator, Paragraph
+ C Other
+ Cc Other, Control
+ Cf Other, Format
+ Cs Other, Surrogate
+ Co Other, Private Use
+ Cn Other, Not Assigned (Unicode defines no Cn characters)
+
+Additionally, because scripts differ in their directionality
+(for example Hebrew is written right to left), all characters
+have their directionality defined:
+
+ BidiL Left-to-Right
+ BidiLRE Left-to-Right Embedding
+ BidiLRO Left-to-Right Override
+ BidiR Right-to-Left
+ BidiAL Right-to-Left Arabic
+ BidiRLE Right-to-Left Embedding
+ BidiRLO Right-to-Left Override
+ BidiPDF Pop Directional Format
+ BidiEN European Number
+ BidiES European Number Separator
+ BidiET European Number Terminator
+ BidiAN Arabic Number
+ BidiCS Common Number Separator
+ BidiNSM Non-Spacing Mark
+ BidiBN Boundary Neutral
+ BidiB Paragraph Separator
+ BidiS Segment Separator
+ BidiWS Whitespace
+ BidiON Other Neutrals
+
+For the the full and latest information see the latest Unicode standard.
C<\X> is an abbreviation for a character class sequence that includes
the Unicode 'combining character sequences'. A 'combining character