diff options
author | Karl Williamson <khw@cpan.org> | 2016-01-18 14:25:02 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-01-19 15:08:59 -0700 |
commit | 6b659339f976d014a1a53731d86cedd01f5921ec (patch) | |
tree | 852b02830e8b19bf700af95791214485d1f4e2e8 /pod/perlrebackslash.pod | |
parent | ca8226cfa2cc0ddcc50f60505c42078df8e3b766 (diff) | |
download | perl-6b659339f976d014a1a53731d86cedd01f5921ec.tar.gz |
Add qr/\b{lb}/
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
Diffstat (limited to 'pod/perlrebackslash.pod')
-rw-r--r-- | pod/perlrebackslash.pod | 22 |
1 files changed, 16 insertions, 6 deletions
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 616aa447c8..f27da1fc3c 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -529,7 +529,7 @@ Mnemonic: I<G>lobal. C<\b{...}>, available starting in v5.22, matches a boundary (between two characters, or before the first character of the string, or after the final character of the string) based on the Unicode rules for the -boundary type specified inside the braces. The currently known boundary +boundary type specified inside the braces. The boundary types are given a few paragraphs below. C<\B{...}> matches at any place between characters where C<\b{...}> of the same type doesn't match. @@ -551,7 +551,7 @@ the non-word "=", there must be a word character immediately previous. All plain C<\b> and C<\B> boundary determinations look for word characters alone, not for non-word characters nor for string ends. It may help to understand how -<\b> and <\B> work by equating them as follows: +C<\b> and C<\B> work by equating them as follows: \b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)) \B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)) @@ -559,8 +559,9 @@ non-word characters nor for string ends. It may help to understand how In contrast, C<\b{...}> and C<\B{...}> may or may not match at the beginning and end of the line, depending on the boundary type. These implement the Unicode default boundaries, specified in +L<http://www.unicode.org/reports/tr14/> and L<http://www.unicode.org/reports/tr29/>. -The boundary types currently available are: +The boundary types are: =over @@ -572,6 +573,18 @@ explained below under L</C<\X>>. In fact, C<\X> is another way to get the same functionality. It is equivalent to C</.+?\b{gcb}/>. Use whichever is most convenient for your situation. +=item C<\b{lb}> + +This matches according to the default Unicode Line Breaking Algorithm +(L<http://www.unicode.org/reports/tr14/>), as customized in that +document +(L<Example 7 of revision 35|http://www.unicode.org/reports/tr14/tr14-35.html#Example7>) +for better handling of numeric expressions. + +This is suitable for many purposes, but the L<Unicode::LineBreak> module +is available on CPAN that provides many more features, including +customization. + =item C<\b{sb}> This matches a Unicode "Sentence Boundary". This is an aid to parsing @@ -640,9 +653,6 @@ particular purposes and locales. For example, some languages, such as Japanese and Thai, require dictionary lookup to determine word boundaries. -Unicode defines a fourth boundary type, accessible through the -L<Unicode::LineBreak> module. - Mnemonic: I<b>oundary. =back |