diff options
author | Karl Williamson <khw@cpan.org> | 2015-02-21 12:16:40 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-02-21 12:48:04 -0700 |
commit | d90f68193029ea3c44b13561f94dbc565e54b3f0 (patch) | |
tree | 5b242e6551cbb318782f197efb05cdc7f6b099bf /pod/perlrebackslash.pod | |
parent | 412a49a2e5f96f24a2a97864bb3af0f6023b3e7b (diff) | |
download | perl-d90f68193029ea3c44b13561f94dbc565e54b3f0.tar.gz |
perlrebackslash: Amplify and correct \b{sb}, \b{wb}
Diffstat (limited to 'pod/perlrebackslash.pod')
-rw-r--r-- | pod/perlrebackslash.pod | 20 |
1 files changed, 15 insertions, 5 deletions
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 876d8741e7..425299d7e6 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -577,15 +577,24 @@ whichever is most convenient for your situation. This matches a Unicode "Sentence Boundary". This is an aid to parsing natural language sentences. It gives good, but imperfect results. For example, it thinks that "Mr. Smith" is two sentences. More details are -at L<http://www.unicode.org/reports/tr29/>. +at L<http://www.unicode.org/reports/tr29/>. Note also that it thinks +that anything matching L</\R> (except form feed and vertical tab) is a +sentence boundary. This works with word-processor text which line wraps +automatically for display, but hard-coded line boundaries are considered +to be essentially the ends of text blocks (paragraphs really), and hence +the ends of sententces. It doesn't well with text containing embedded +newlines, like the source text of the document you are reading. Such +text needs to be preprocessed to get rid of the line separators before +looking for sentence boundaries. Some people view this as a bug in the +Unicode standard. =item C<\b{wb}> This matches a Unicode "Word Boundary". This gives better (though not perfect) results for natural language processing than plain C<\b> (without braces) does. For example, it understands that apostrophes can -be in the middle of words. More details are at -L<http://www.unicode.org/reports/tr29/>. +be in the middle of words and that parentheses aren't. More details +are at L<http://www.unicode.org/reports/tr29/>. =back @@ -612,9 +621,10 @@ Mnemonic: I<b>oundary. print $1; # Prints 'cat' } - print join "\n", "I don't care" =~ m/ ( .+? \b{wb} ) /xg; + print join "|", "He said, \"Do you care? (I don't).\"" + =~ m/ ( .+? \b{wb} ) /xg; prints - I, ,don't, ,care + He| |said|,| |"|Do| |you| |care|?| |(|I| |don't|)|.|" =head2 Misc |