diff options
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perldelta.pod | 9 | ||||
-rw-r--r-- | pod/perlrebackslash.pod | 22 | ||||
-rw-r--r-- | pod/perlunicode.pod | 8 |
3 files changed, 31 insertions, 8 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 8aa2456570..0a697d3ed5 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -27,6 +27,15 @@ here, but most should go in the L</Performance Enhancements> section. [ List each enhancement as a =head2 entry ] +=head2 New C<\b{lb}> boundary in regular expressions + +C<lb> stands for Line Break. It is a Unicode property +that determines where a line of text is suitable to break (typically so +that it can be output without overflowing the available horizontal +space). This capability has long been furnished by the +L<Unicode::LineBreak> module, but now a light-weight, non-customizable +version that is suitable for many purposes is in core Perl. + =head1 Security =head2 fix out of boundary access in Win32 path handling diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 616aa447c8..f27da1fc3c 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -529,7 +529,7 @@ Mnemonic: I<G>lobal. C<\b{...}>, available starting in v5.22, matches a boundary (between two characters, or before the first character of the string, or after the final character of the string) based on the Unicode rules for the -boundary type specified inside the braces. The currently known boundary +boundary type specified inside the braces. The boundary types are given a few paragraphs below. C<\B{...}> matches at any place between characters where C<\b{...}> of the same type doesn't match. @@ -551,7 +551,7 @@ the non-word "=", there must be a word character immediately previous. All plain C<\b> and C<\B> boundary determinations look for word characters alone, not for non-word characters nor for string ends. It may help to understand how -<\b> and <\B> work by equating them as follows: +C<\b> and C<\B> work by equating them as follows: \b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)) \B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)) @@ -559,8 +559,9 @@ non-word characters nor for string ends. It may help to understand how In contrast, C<\b{...}> and C<\B{...}> may or may not match at the beginning and end of the line, depending on the boundary type. These implement the Unicode default boundaries, specified in +L<http://www.unicode.org/reports/tr14/> and L<http://www.unicode.org/reports/tr29/>. -The boundary types currently available are: +The boundary types are: =over @@ -572,6 +573,18 @@ explained below under L</C<\X>>. In fact, C<\X> is another way to get the same functionality. It is equivalent to C</.+?\b{gcb}/>. Use whichever is most convenient for your situation. +=item C<\b{lb}> + +This matches according to the default Unicode Line Breaking Algorithm +(L<http://www.unicode.org/reports/tr14/>), as customized in that +document +(L<Example 7 of revision 35|http://www.unicode.org/reports/tr14/tr14-35.html#Example7>) +for better handling of numeric expressions. + +This is suitable for many purposes, but the L<Unicode::LineBreak> module +is available on CPAN that provides many more features, including +customization. + =item C<\b{sb}> This matches a Unicode "Sentence Boundary". This is an aid to parsing @@ -640,9 +653,6 @@ particular purposes and locales. For example, some languages, such as Japanese and Thai, require dictionary lookup to determine word boundaries. -Unicode defines a fourth boundary type, accessible through the -L<Unicode::LineBreak> module. - Mnemonic: I<b>oundary. =back diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index eb23d55cc1..775a4307a4 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -1176,11 +1176,15 @@ Also, lines should not be split within C<CRLF> (i.e. there is no empty line between C<\r> and C<\n>). For C<CRLF>, try the C<:crlf> layer (see L<PerlIO>). -=item [9] But C<L<Unicode::LineBreak>> is available. +=item [9] But C<qr/\b{lb}/> and C<L<Unicode::LineBreak>> are available. -This module supplies line breaking conformant with +L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> supplies default line +breaking conformant with L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>. +And, the module C<L<Unicode::LineBreak>> also conformant with UAX#14, +provides customizable line breaking. + =item [10] UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to C<U+10FFFF> but also beyond C<U+10FFFF> |