summaryrefslogtreecommitdiff
path: root/pod/perlrecharclass.pod
diff options
context:
space:
mode:
authorRafael Garcia-Suarez <rgarciasuarez@gmail.com>2007-04-30 15:34:29 +0000
committerRafael Garcia-Suarez <rgarciasuarez@gmail.com>2007-04-30 15:34:29 +0000
commit8a11820602bf2132dec1f28575fd8f008fe20205 (patch)
tree459aeed5e6cb99ec5437c7d4996a52bad35ec632 /pod/perlrecharclass.pod
parent59fe32ea432417e2fb92353859d7fef999a5dea1 (diff)
downloadperl-8a11820602bf2132dec1f28575fd8f008fe20205.tar.gz
Two new manpages, by Abigail
p4raw-id: //depot/perl@31110
Diffstat (limited to 'pod/perlrecharclass.pod')
-rw-r--r--pod/perlrecharclass.pod525
1 files changed, 525 insertions, 0 deletions
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
new file mode 100644
index 0000000000..afdf11627a
--- /dev/null
+++ b/pod/perlrecharclass.pod
@@ -0,0 +1,525 @@
+=head1 NAME
+
+perlrecharclass - Perl Regular Expression Character Classes
+
+=head1 DESCRIPTION
+
+The top level documentation about Perl regular expressions
+is found in L<perlre>.
+
+This manual page discusses the syntax and use of character
+classes in Perl Regular Expressions.
+
+A character class is a way of denoting a set of characters,
+in such a way that one character of the set is matched.
+It's important to remember that matching a character class
+consumes exactly one character in the source string. (The source
+string is the string the regular expression is matched against.)
+
+There are three types of character classes in Perl regular
+expressions: the dot, backslashed sequences, and the bracketed form.
+
+=head2 The dot
+
+The dot (or period), C<.> is probably the most used, and certainly
+the most well-known character class. By default, a dot matches any
+character, except for the newline. The default can be changed to
+add matching the newline with the I<single line> modifier: either
+for the entire regular expression using the C</s> modifier, or
+locally using C<(?s)>.
+
+Here are some examples:
+
+ "a" =~ /./ # Match
+ "." =~ /./ # Match
+ "" =~ /./ # No match (dot has to match a character)
+ "\n" =~ /./ # No match (dot does not match a newline)
+ "\n" =~ /./s # Match (global 'single line' modifier)
+ "\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
+ "ab" =~ /^.$/ # No match (dot matches one character)
+
+
+=head2 Backslashed sequences
+
+Perl regular expressions contain many backslashed sequences that
+constitute a character class. That is, they will match a single
+character, if that character belongs to a specific set of characters
+(defined by the sequence). A backslashed sequence is a sequence of
+characters starting with a backslash. Not all backslashed sequences
+are character class; for a full list, see L<perlrebackslash>.
+
+Here's a list of the backslashed sequences, which are discussed in
+more detail below.
+
+ \d Match a digit character.
+ \D Match a non-digit character.
+ \w Match a "word" character.
+ \W Match a non-"word" character.
+ \s Match a white space character.
+ \S Match a non-white space character.
+ \h Match a horizontal white space character.
+ \H Match a character that isn't horizontal white space.
+ \v Match a vertical white space character.
+ \V Match a character that isn't vertical white space.
+ \pP, \p{Prop} Match a character matching a Unicode property.
+ \PP, \P{Prop} Match a character that doesn't match a Unicode property.
+
+=head3 Digits
+
+C<\d> matches a single character that is considered to be a I<digit>.
+What is considered a digit depends on the internal encoding of
+the source string. If the source string is in UTF-8 format, C<\d>
+not only matches the digits '0' - '9', but also Arabic, Devanagari and
+digits from other languages. Otherwise, if there is a locale in effect,
+it will match whatever characters the locale considers digits. Without
+a locale, C<\d> matches the digits '0' to '9'.
+See L</Locale, Unicode and UTF-8>.
+
+Any character that isn't matched by C<\d> will be matched by C<\D>.
+
+=head3 Word characters
+
+C<\w> matches a single I<word> character: an alphanumeric character
+(that is, an alphabetic character, or a digit), or the underscore (C<_>).
+What is considered a word character depends on the internal encoding
+of the string. If it's in UTF-8 format, C<\w> matches those characters
+that are considered word characters in the Unicode database. That is, it
+not only matches ASCII letters, but also Thai letters, Greek letters, etc.
+If the source string isn't in UTF-8 format, C<\w> matches those characters
+that are considered word characters by the current locale. Without
+a locale in effect, C<\w> matches the ASCII letters, digits and the
+underscore.
+
+Any character that isn't matched by C<\w> will be matched by C<\W>.
+
+=head3 White space
+
+C<\s> matches any single character that is consider white space. In the
+ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
+(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
+space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set
+of characters matched by C<\s> depends on whether the source string is
+in UTF-8 format. If it is, C<\s> matches what is considered white space
+in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
+matches whatever is considered white space by the current locale. Without
+a locale, C<\s> matches the five characters mentioned in the beginning
+of this paragraph. Perhaps the most notable difference is that C<\s>
+matches a non-breaking space only if the non-breaking space is in a
+UTF-8 encoded string.
+
+Any character that isn't matched by C<\s> will be matched by C<\S>.
+
+C<\h> will match any character that is considered horizontal white space;
+this includes the space and the tab characters. C<\H> will match any character
+that is not considered horizontal white space.
+
+C<\v> will match any character that is considered vertical white space;
+this includes the carriage return and line feed characters (newline).
+C<\V> will match any character that is not considered vertical white space.
+
+C<\R> matches anything that can be considered a newline under Unicode
+rules. It's not a character class, as it can match a multi-character
+sequence. Therefore, it cannot be used inside a bracketed character
+class. Details are discussed in L<perlrebackslash>.
+
+C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.
+
+Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
+the same characters, regardless whether the source string is in UTF-8
+format or not. The set of characters they match is also not influenced
+by locale.
+
+One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
+The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
+considered vertical white space. Furthermore, if the source string is
+not in UTF-8 format, the next line (C<"\x85">) and the no-break space
+(C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
+If the source string is in UTF-8 format, both the next line and the
+no-break space are matched by C<\s>.
+
+The following table is a complete listing of characters matched by
+C<\s>, C<\h> and C<\v>.
+
+The first column gives the code point of the character (in hex format),
+the second column gives the (Unicode) name. The third column indicates
+by which class(es) the character is matched.
+
+ 0x00009 CHARACTER TABULATION h s
+ 0x0000a LINE FEED (LF) vs
+ 0x0000b LINE TABULATION v
+ 0x0000c FORM FEED (FF) vs
+ 0x0000d CARRIAGE RETURN (CR) vs
+ 0x00020 SPACE h s
+ 0x00085 NEXT LINE (NEL) vs [1]
+ 0x000a0 NO-BREAK SPACE h s [1]
+ 0x01680 OGHAM SPACE MARK h s
+ 0x0180e MONGOLIAN VOWEL SEPARATOR h s
+ 0x02000 EN QUAD h s
+ 0x02001 EM QUAD h s
+ 0x02002 EN SPACE h s
+ 0x02003 EM SPACE h s
+ 0x02004 THREE-PER-EM SPACE h s
+ 0x02005 FOUR-PER-EM SPACE h s
+ 0x02006 SIX-PER-EM SPACE h s
+ 0x02007 FIGURE SPACE h s
+ 0x02008 PUNCTUATION SPACE h s
+ 0x02009 THIN SPACE h s
+ 0x0200a HAIR SPACE h s
+ 0x02028 LINE SEPARATOR vs
+ 0x02029 PARAGRAPH SEPARATOR vs
+ 0x0202f NARROW NO-BREAK SPACE h s
+ 0x0205f MEDIUM MATHEMATICAL SPACE h s
+ 0x03000 IDEOGRAPHIC SPACE h s
+
+=over 4
+
+=item [1]
+
+NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
+UTF-8 format.
+
+=back
+
+It is worth noting that C<\d>, C<\w>, etc, match single characters, not
+complete numbers or words. To match a number (that consists of integers),
+use C<\d+>; to match a word, use C<\w+>.
+
+
+=head3 Unicode Properties
+
+C<\pP> and C<\p{Prop}> are character classes to match characters that
+fit given Unicode classes. One letter classes can be used in the C<\pP>
+form, with the class name following the C<\p>, otherwise, the property
+name is enclosed in braces, and follows the C<\p>. For instance, a
+match for a number can be written as C</\pN/> or as C</\p{Number}/>.
+Lowercase letters are matched by the property I<LowercaseLetter> which
+has as short form I<Ll>. They have to be written as C</\p{Ll}/> or
+C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different.
+It matches a two character string: a letter (Unicode property C<\pL>),
+followed by a lowercase C<l>.
+
+For a list of possible properties, see
+L<perlunicode/Unicode Character Properties>. It is also possible to
+defined your own properties. This is discussed in
+L<perlunicode/User-Defined Character Properties>.
+
+
+=head4 Examples
+
+ "a" =~ /\w/ # Match, "a" is a 'word' character.
+ "7" =~ /\w/ # Match, "7" is a 'word' character as well.
+ "a" =~ /\d/ # No match, "a" isn't a digit.
+ "7" =~ /\d/ # Match, "7" is a digit.
+ " " =~ /\s/ # Match, a space is white space.
+ "a" =~ /\D/ # Match, "a" is a non-digit.
+ "7" =~ /\D/ # No match, "7" is not a non-digit.
+ " " =~ /\S/ # No match, a space is not non-white space.
+
+ " " =~ /\h/ # Match, space is horizontal white space.
+ " " =~ /\v/ # No match, space is not vertical white space.
+ "\r" =~ /\v/ # Match, a return is vertical white space.
+
+ "a" =~ /\pL/ # Match, "a" is a letter.
+ "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
+
+ "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
+ # 'THAI CHARACTER SO SO', and that's in
+ # Thai Unicode class.
+ "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character.
+
+
+=head2 Bracketed Character Classes
+
+The third form of character class you can use in Perl regular expressions
+is the bracketed form. In its simplest form, it lists the characters
+that may be matched inside square brackets, like this: C<[aeiou]>.
+This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
+character classes, exactly one character will be matched. To match
+a longer string consisting of characters mentioned in the characters
+class, follow the character class with a quantifier. For instance,
+C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
+
+Repeating a character in a character class has no
+effect; it's considered to be in the set only once.
+
+Examples:
+
+ "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
+ "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
+ "ae" =~ /^[aeiou]$/ # No match, a character class only matches
+ # a single character.
+ "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
+
+=head3 Special Characters Inside a Bracketed Character Class
+
+Most characters that are meta characters in regular expressions (that
+is, characters that carry a special meaning like C<*> or C<(>) lose
+their special meaning and can be used inside a character class without
+the need to escape them. For instance, C<[()]> matches either an opening
+parenthesis, or a closing parenthesis, and the parens inside the character
+class don't group or capture.
+
+Characters that may carry a special meaning inside a character class are:
+C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
+escaped with a backslash, although this is sometimes not needed, in which
+case the backslash may be omitted.
+
+The sequence C<\b> is special inside a bracketed character class. While
+outside the character class C<\b> is an assertion indicating a point
+that does not have either two word characters or two non-word characters
+on either side, inside a bracketed character class, C<\b> matches a
+backspace character.
+
+A C<[> is not special inside a character class, unless it's the start
+of a POSIX character class (see below). It normally does not need escaping.
+
+A C<]> is either the end of a POSIX character class (see below), or it
+signals the end of the bracketed character class. Normally it needs
+escaping if you want to include a C<]> in the set of characters.
+However, if the C<]> is the I<first> (or the second if the first
+character is a caret) character of a bracketed character class, it
+does not denote the end of the class (as you cannot have an empty class)
+and is considered part of the set of characters that can be matched without
+escaping.
+
+Examples:
+
+ "+" =~ /[+?*]/ # Match, "+" in a character class is not special.
+ "\cH" =~ /[\b]/ # Match, \b inside in a character class
+ # is equivalent with a backspace.
+ "]" =~ /[][]/ # Match, as the character class contains.
+ # both [ and ].
+ "[]" =~ /[[]]/ # Match, the pattern contains a character class
+ # containing just ], and the character class is
+ # followed by a ].
+
+=head3 Character Ranges
+
+It is not uncommon to want to match a range of characters. Luckily, instead
+of listing all the characters in the range, one may use the hyphen (C<->).
+If inside a bracketed character class you have two characters separated
+by a hyphen, it's treated as if all the characters between the two are in
+the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
+matches any lowercase letter from the first half of the ASCII alphabet.
+
+Note that the two characters on either side of the hyphen are not
+necessary both letters or both digits. Any character is possible,
+although not advisable. C<['-?]> contains a range of characters, but
+most people will not know which characters that will be. Furthermore,
+such ranges may lead to portability problems if the code has to run on
+a platform that uses a different character set, such as EBCDIC.
+
+If a hyphen in a character class cannot be part of a range, for instance
+because it is the first or the last character of the character class,
+or if it immediately follows a range, the hyphen isn't special, and will be
+considered a character that may be matched. You have to escape the hyphen
+with a backslash if you want to have a hyphen in your set of characters to
+be matched, and its position in the class is such that it can be considered
+part of a range.
+
+Examples:
+
+ [a-z] # Matches a character that is a lower case ASCII letter.
+ [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the
+ # letter 'z'.
+ [-z] # Matches either a hyphen ('-') or the letter 'z'.
+ [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
+ # hyphen ('-'), or the letter 'm'.
+ ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
+ # (But not on an EBCDIC platform).
+
+
+=head3 Negation
+
+It is also possible to instead list the characters you do not want to
+match. You can do so by using a caret (C<^>) as the first character in the
+character class. For instance, C<[^a-z]> matches a character that is not a
+lowercase ASCII letter.
+
+This syntax make the caret a special character inside a bracketed character
+class, but only if it is the first character of the class. So if you want
+to have the caret as one of the characters you want to match, you either
+have to escape the caret, or not list it first.
+
+Examples:
+
+ "e" =~ /[^aeiou]/ # No match, the 'e' is listed.
+ "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
+ "^" =~ /[^^]/ # No match, matches anything that isn't a caret.
+ "^" =~ /[x^]/ # Match, caret is not special here.
+
+=head3 Backslash Sequences
+
+You can put a backslash sequence character class inside a bracketed character
+class, and it will act just as if you put all the characters matched by
+the backslash sequence inside the character class. For instance,
+C<[a-f\d]> will match any digit, or any of the lowercase letters between
+'a' and 'f' inclusive.
+
+Examples:
+
+ /[\p{Thai}\d]/ # Matches a character that is either a Thai
+ # character, or a digit.
+ /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
+ # character, nor a parenthesis.
+
+Backslash sequence character classes cannot form one of the endpoints
+of a range.
+
+=head3 Posix Character Classes
+
+Posix character classes have the form C<[:class:]>, where I<class> is
+name, and the C<[:> and C<:]> delimiters. Posix character classes appear
+I<inside> bracketed character classes, and are a convenient and descriptive
+way of listing a group of characters. Be careful about the syntax,
+
+ # Correct:
+ $string =~ /[[:alpha:]]/
+
+ # Incorrect (will warn):
+ $string =~ /[:alpha:]/
+
+The latter pattern would be a character class consisting of a colon,
+and the letters C<a>, C<l>, C<p> and C<h>.
+
+Perl recognizes the following POSIX character classes:
+
+ alpha Any alphabetical character.
+ alnum Any alphanumerical character.
+ ascii Any ASCII character.
+ blank A GNU extension, equal to a space or a horizontal tab (C<\t>).
+ cntrl Any control character.
+ digit Any digit, equivalent to C<\d>.
+ graph Any printable character, excluding a space.
+ lower Any lowercase character.
+ print Any printable character, including a space.
+ punct Any punctuation character.
+ space Any white space character. C<\s> plus the vertical tab (C<\cK>).
+ upper Any uppercase character.
+ word Any "word" character, equivalent to C<\w>.
+ xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
+
+The exact set of characters matched depends on whether the source string
+is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
+
+Most POSIX character classes have C<\p> counterparts. The difference
+is that the C<\p> classes will always match according to the Unicode
+properties, regardless whether the string is in UTF-8 format or not.
+
+The following table shows the relation between POSIX character classes
+and the Unicode properties:
+
+ [[:...:]] \p{...} backslash
+
+ alpha IsAlpha
+ alnum IsAlnum
+ ascii IsASCII
+ blank
+ cntrl IsCntrl
+ digit IsDigit \d
+ graph IsGraph
+ lower IsLower
+ print IsPrint
+ punct IsPunct
+ space IsSpace
+ IsSpacePerl \s
+ upper IsUpper
+ word IsWord
+ xdigit IsXDigit
+
+Some character classes may have a non-obvious name:
+
+=over 4
+
+=item cntrl
+
+Any control character. Usually, control characters don't produce output
+as such, but instead control the terminal somehow: for example newline
+and backspace are control characters. All characters with C<ord()> less
+than 32 are usually classified as control characters (in ASCII, the ISO
+Latin character sets, and Unicode), as is the character C<ord()> value
+of 127 (C<DEL>).
+
+=item graph
+
+Any character that is I<graphical>, that is, visible. This class consists
+of all the alphanumerical characters and all punctuation characters.
+
+=item print
+
+All printable characters, which is the set of all the graphical characters
+plus the space.
+
+=item punct
+
+Any punctuation (special) character.
+
+=back
+
+=head4 Negation
+
+A Perl extension to the POSIX character class is the ability to
+negate it. This is done by prefixing the class name with a caret (C<^>).
+Some examples:
+
+ POSIX Unicode Backslash
+ [[:^digit:]] \P{IsDigit} \D
+ [[:^space:]] \P{IsSpace} \S
+ [[:^word:]] \P{IsWord} \W
+
+=head4 [= =] and [. .]
+
+Perl will recognize the POSIX character classes C<[=class=]>, and
+C<[.class.]>, but does not (yet?) support this construct. Use of
+such a constructs will lead to an error.
+
+
+=head4 Examples
+
+ /[[:digit:]]/ # Matches a character that is a digit.
+ /[01[:lower:]]/ # Matches a character that is either a
+ # lowercase letter, or '0' or '1'.
+ /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
+ # but the letters 'a' to 'f' in either case.
+ # This is because the character class contains
+ # all digits, and anything that isn't a
+ # hex digit, resulting in a class containing
+ # all characters, but the letters 'a' to 'f'
+ # and 'A' to 'F'.
+
+
+=head2 Locale, Unicode and UTF-8
+
+Some of the character classes have a somewhat different behaviour depending
+on the internal encoding of the source string, and the locale that is
+in effect.
+
+C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
+including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
+
+The rule is that if the source string is in UTF-8 format, the character
+classes match according to the Unicode properties. If the source string
+isn't, then the character classes match according to whatever locale is
+in effect. If there is no locale, they match the ASCII defaults
+(52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
+
+This usually means that if you are matching against characters whose C<ord()>
+values are between 128 and 255 inclusive, your character class may match
+or not depending on the current locale, and whether the source string is
+in UTF-8 format. The string will be in UTF-8 format if it contains
+characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
+format without it having such characters.
+
+For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
+or the POSIX character classes, and use the Unicode properties instead.
+
+=head4 Examples
+
+ $str = "\xDF"; # $str is not in UTF-8 format.
+ $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
+ $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
+ $str =~ /^\w/; # Match! $str is now in UTF-8 format.
+ chop $str;
+ $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
+
+=cut