diff options
author | Karl Williamson <khw@cpan.org> | 2016-01-05 16:12:55 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-01-08 14:17:11 -0700 |
commit | f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 (patch) | |
tree | 52365bdb2759341217eb979be04a61f5b351eb2f /pod/perlrebackslash.pod | |
parent | cbdbe9d466e0d26852ca1ace0825220c8ca7d215 (diff) | |
download | perl-f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205.tar.gz |
Tailor \b{wb} for Perl
The Unicode \b{wb} matches the boundary between space characters in a
span of them. This is opposite of what \b does, and is counterintuitive
to Perl expectations. This commit tailors \b{wb} to not split up spans
of white space.
I have submitted a request to Unicode to re-examine their algorithm, and
this has been assigned to a subcommittee to look at, but the result
won't be available until after 5.24 is done. In any event, Unicode
encourages tailoring for local conditions.
Diffstat (limited to 'pod/perlrebackslash.pod')
-rw-r--r-- | pod/perlrebackslash.pod | 19 |
1 files changed, 18 insertions, 1 deletions
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index e0e4661158..616aa447c8 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -592,12 +592,29 @@ future Perl versions. =item C<\b{wb}> -This matches a Unicode "Word Boundary". This gives better (though not +This matches a Unicode "Word Boundary", but tailored to Perl +expectations. This gives better (though not perfect) results for natural language processing than plain C<\b> (without braces) does. For example, it understands that apostrophes can be in the middle of words and that parentheses aren't (see the examples below). More details are at L<http://www.unicode.org/reports/tr29/>. +The current Unicode definition of a Word Boundary matches between every +white space character. Perl tailors this, starting in version 5.24, to +generally not break up spans of white space, just as plain C<\b> has +always functioned. This allows C<\b{wb}> to be a drop-in replacement for +C<\b>, but with generally better results for natural language +processing. (The exception to this tailoring is when a span of white +space is immediately followed by something like U+0303, COMBINING TILDE. +If the final space character in the span is a horizontal white space, it +is broken out so that it attaches instead to the combining character. +To be precise, if a span of white space that ends in a horizontal space +has the character immediately following it have either of the Word +Boundary property values "Extend" or "Format", the boundary between the +final horizontal space character and the rest of the span matches +C<\b{wb}>. In all other cases the boundary between two white space +characters matches C<\B{wb}>.) + =back It is important to realize when you use these Unicode boundaries, |