Tailor \b{wb} for Perl

The Unicode \b{wb} matches the boundary between space characters in a span of them. This is opposite of what \b does, and is counterintuitive to Perl expectations. This commit tailors \b{wb} to not split up spans of white space. I have submitted a request to Unicode to re-examine their algorithm, and this has been assigned to a subcommittee to look at, but the result won't be available until after 5.24 is done. In any event, Unicode encourages tailoring for local conditions.
author: Karl Williamson <khw@cpan.org> 2016-01-05 16:12:55 -0700
committer: Karl Williamson <khw@cpan.org> 2016-01-08 14:17:11 -0700
commit: f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 (patch)
tree: 52365bdb2759341217eb979be04a61f5b351eb2f /pod/perlrebackslash.pod
parent: cbdbe9d466e0d26852ca1ace0825220c8ca7d215 (diff)
download: perl-f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205.tar.gz
1 files changed, 18 insertions, 1 deletions
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index e0e4661158..616aa447c8 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -592,12 +592,29 @@ future Perl versions.
 
 =item C<\b{wb}>
 
-This matches a Unicode "Word Boundary".  This gives better (though not
+This matches a Unicode "Word Boundary", but tailored to Perl
+expectations.  This gives better (though not
 perfect) results for natural language processing than plain C<\b>
 (without braces) does.  For example, it understands that apostrophes can
 be in the middle of words and that parentheses aren't (see the examples
 below).  More details are at L<http://www.unicode.org/reports/tr29/>.
 
+The current Unicode definition of a Word Boundary matches between every
+white space character.  Perl tailors this, starting in version 5.24, to
+generally not break up spans of white space, just as plain C<\b> has
+always functioned.  This allows C<\b{wb}> to be a drop-in replacement for
+C<\b>, but with generally better results for natural language
+processing.  (The exception to this tailoring is when a span of white
+space is immediately followed by something like U+0303, COMBINING TILDE.
+If the final space character in the span is a horizontal white space, it
+is broken out so that it attaches instead to the combining character.
+To be precise, if a span of white space that ends in a horizontal space
+has the character immediately following it have either of the Word
+Boundary property values "Extend" or "Format", the boundary between the
+final horizontal space character and the rest of the span matches
+C<\b{wb}>.  In all other cases the boundary between two white space
+characters matches C<\B{wb}>.)
+
 =back
 
 It is important to realize when you use these Unicode boundaries,
author	Karl Williamson <khw@cpan.org>	2016-01-05 16:12:55 -0700
committer	Karl Williamson <khw@cpan.org>	2016-01-08 14:17:11 -0700
commit	f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 (patch)
tree	52365bdb2759341217eb979be04a61f5b351eb2f /pod/perlrebackslash.pod
parent	cbdbe9d466e0d26852ca1ace0825220c8ca7d215 (diff)
download	perl-f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205.tar.gz