diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2001-05-04 03:22:59 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2001-05-04 03:22:59 +0000 |
commit | 773ee785b5f3b7ef0dc8377e0c21172f01981c54 (patch) | |
tree | 216ecf32800727a3584838e0d14787a786415aa7 /pod/perlretut.pod | |
parent | b4ccb3669c4ba4f143dd8c8c0db67436d390acba (diff) | |
download | perl-773ee785b5f3b7ef0dc8377e0c21172f01981c54.tar.gz |
Document the \pX and \p{Yz} (and \p{BidiXYZ}) classes a bit more.
p4raw-id: //depot/perl@9983
Diffstat (limited to 'pod/perlretut.pod')
-rw-r--r-- | pod/perlretut.pod | 70 |
1 files changed, 69 insertions, 1 deletions
diff --git a/pod/perlretut.pod b/pod/perlretut.pod index fa6479c0c4..2960950065 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1745,7 +1745,75 @@ You can also use the official Unicode class names with the C<\p> and C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase letters, or C<\P{Nd}> for non-digits. If a C<name> is just one letter, the braces can be dropped. For instance, C<\pM> is the -character class of Unicode 'marks'. +character class of Unicode 'marks', for example accent marks. +Here is the list as of Unicode 3.1.0 (the two-letter classes) and +Perl 5.8.0 (the one-letter classes): + + L Letter + Lu Letter, Uppercase + Ll Letter, Lowercase + Lt Letter, Titlecase + Lm Letter, Modifier + Lo Letter, Other + M Mark + Mn Mark, Non-Spacing + Mc Mark, Spacing Combining + Me Mark, Enclosing + N Number + Nd Number, Decimal Digit + Nl Number, Letter + No Number, Other + P Punctuation + Pc Punctuation, Connector + Pd Punctuation, Dash + Ps Punctuation, Open + Pe Punctuation, Close + Pi Punctuation, Initial quote + (may behave like Ps or Pe depending on usage) + Pf Punctuation, Final quote + (may behave like Ps or Pe depending on usage) + Po Punctuation, Other + S Symbol + Sm Symbol, Math + Sc Symbol, Currency + Sk Symbol, Modifier + So Symbol, Other + Z Separator + Zs Separator, Space + Zl Separator, Line + Zp Separator, Paragraph + C Other + Cc Other, Control + Cf Other, Format + Cs Other, Surrogate + Co Other, Private Use + Cn Other, Not Assigned (Unicode defines no Cn characters) + +Additionally, because scripts differ in their directionality +(for example Hebrew is written right to left), all characters +have their directionality defined: + + BidiL Left-to-Right + BidiLRE Left-to-Right Embedding + BidiLRO Left-to-Right Override + BidiR Right-to-Left + BidiAL Right-to-Left Arabic + BidiRLE Right-to-Left Embedding + BidiRLO Right-to-Left Override + BidiPDF Pop Directional Format + BidiEN European Number + BidiES European Number Separator + BidiET European Number Terminator + BidiAN Arabic Number + BidiCS Common Number Separator + BidiNSM Non-Spacing Mark + BidiBN Boundary Neutral + BidiB Paragraph Separator + BidiS Segment Separator + BidiWS Whitespace + BidiON Other Neutrals + +For the the full and latest information see the latest Unicode standard. C<\X> is an abbreviation for a character class sequence that includes the Unicode 'combining character sequences'. A 'combining character |