Add qr/\b{lb}/

This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
author: Karl Williamson <khw@cpan.org> 2016-01-18 14:25:02 -0700
committer: Karl Williamson <khw@cpan.org> 2016-01-19 15:08:59 -0700
commit: 6b659339f976d014a1a53731d86cedd01f5921ec (patch)
tree: 852b02830e8b19bf700af95791214485d1f4e2e8 /pod
parent: ca8226cfa2cc0ddcc50f60505c42078df8e3b766 (diff)
download: perl-6b659339f976d014a1a53731d86cedd01f5921ec.tar.gz
3 files changed, 31 insertions, 8 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 8aa2456570..0a697d3ed5 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -27,6 +27,15 @@ here, but most should go in the L</Performance Enhancements> section.
 
 [ List each enhancement as a =head2 entry ]
 
+=head2 New C<\b{lb}> boundary in regular expressions
+
+C<lb> stands for Line Break.  It is a Unicode property
+that determines where a line of text is suitable to break (typically so
+that it can be output without overflowing the available horizontal
+space).  This capability has long been furnished by the
+L<Unicode::LineBreak> module, but now a light-weight, non-customizable
+version that is suitable for many purposes is in core Perl.
+
 =head1 Security
 
 =head2 fix out of boundary access in Win32 path handling
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index 616aa447c8..f27da1fc3c 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -529,7 +529,7 @@ Mnemonic: I<G>lobal.
 C<\b{...}>, available starting in v5.22, matches a boundary (between two
 characters, or before the first character of the string, or after the
 final character of the string) based on the Unicode rules for the
-boundary type specified inside the braces.  The currently known boundary
+boundary type specified inside the braces.  The boundary
 types are given a few paragraphs below.  C<\B{...}> matches at any place
 between characters where C<\b{...}> of the same type doesn't match.
 
@@ -551,7 +551,7 @@ the non-word "=", there must be a word character immediately previous.
 All plain C<\b> and C<\B> boundary determinations look for word
 characters alone, not for
 non-word characters nor for string ends.  It may help to understand how
-<\b> and <\B> work by equating them as follows:
+C<\b> and C<\B> work by equating them as follows:
 
     \b	really means	(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
     \B	really means	(?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
@@ -559,8 +559,9 @@ non-word characters nor for string ends.  It may help to understand how
 In contrast, C<\b{...}> and C<\B{...}> may or may not match at the
 beginning and end of the line, depending on the boundary type.  These
 implement the Unicode default boundaries, specified in
+L<http://www.unicode.org/reports/tr14/> and
 L<http://www.unicode.org/reports/tr29/>.
-The boundary types currently available are:
+The boundary types are:
 
 =over
 
@@ -572,6 +573,18 @@ explained below under L</C<\X>>.  In fact, C<\X> is another way to get
 the same functionality.  It is equivalent to C</.+?\b{gcb}/>.  Use
 whichever is most convenient for your situation.
 
+=item C<\b{lb}>
+
+This matches according to the default Unicode Line Breaking Algorithm
+(L<http://www.unicode.org/reports/tr14/>), as customized in that
+document
+(L<Example 7 of revision 35|http://www.unicode.org/reports/tr14/tr14-35.html#Example7>)
+for better handling of numeric expressions.
+
+This is suitable for many purposes, but the L<Unicode::LineBreak> module
+is available on CPAN that provides many more features, including
+customization.
+
 =item C<\b{sb}>
 
 This matches a Unicode "Sentence Boundary".  This is an aid to parsing
@@ -640,9 +653,6 @@ particular purposes and locales.  For example, some languages, such as
 Japanese and Thai, require dictionary lookup to determine word
 boundaries.
 
-Unicode defines a fourth boundary type, accessible through the
-L<Unicode::LineBreak> module.
-
 Mnemonic: I<b>oundary.
 
 =back
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index eb23d55cc1..775a4307a4 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1176,11 +1176,15 @@ Also, lines should not be split within C<CRLF> (i.e. there is no
 empty line between C<\r> and C<\n>).  For C<CRLF>, try the C<:crlf>
 layer (see L<PerlIO>).
 
-=item [9] But C<L<Unicode::LineBreak>> is available.
+=item [9] But C<qr/\b{lb}/> and C<L<Unicode::LineBreak>> are available.
 
-This module supplies line breaking conformant with
+L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> supplies default line
+breaking conformant with
 L<UAX#14 "Unicode Line Breaking Algorithm"|http://www.unicode.org/reports/tr14>.
 
+And, the module C<L<Unicode::LineBreak>> also conformant with UAX#14,
+provides customizable line breaking.
+
 =item [10]
 UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
 C<U+10FFFF> but also beyond C<U+10FFFF>
author	Karl Williamson <khw@cpan.org>	2016-01-18 14:25:02 -0700
committer	Karl Williamson <khw@cpan.org>	2016-01-19 15:08:59 -0700
commit	6b659339f976d014a1a53731d86cedd01f5921ec (patch)
tree	852b02830e8b19bf700af95791214485d1f4e2e8 /pod
parent	ca8226cfa2cc0ddcc50f60505c42078df8e3b766 (diff)
download	perl-6b659339f976d014a1a53731d86cedd01f5921ec.tar.gz