summaryrefslogtreecommitdiff
path: root/pod/perlrebackslash.pod
Commit message (Collapse)AuthorAgeFilesLines
* Allow blanks within and adjacent to {...} constructsKarl Williamson2021-01-201-0/+16
| | | | | This was the consensus in http://nntp.perl.org/group/perl.perl5.porters/258489
* perlrebackslash: A few tweaksKarl Williamson2021-01-201-7/+9
| | | | | Some white-space changes for vertical alignment, a new example, and a couple of clarifications.
* Change some link pod for better renderingKarl Williamson2020-08-311-1/+1
| | | | C<L</foo>> renders better in places than L</C<foo>>
* Unicode.org is https, except for http://cldr.unicode.orgMax Maischein2019-10-111-10/+10
|
* Supply missing right brace in regex exampleJames E Keenan2019-08-311-1/+1
| | | | As suggested by Jim Avera in RT 134395.
* Allow qr'\N{...}'Karl Williamson2019-03-131-1/+8
|
* Spelling corrections in pod/*.pod from Alexandr Savca.Alexandr Savca2018-04-191-1/+1
| | | | | | | | | Alexandr Savca is now a Perl AUTHOR. For: RT #133120 Committer: holding off on the corrections to pod/perlartistic.pod until clarification of change to license text.
* perlrebackslash: Fix a couple of nits.Karl Williamson2017-06-211-3/+3
|
* perlrebackslash: ClarifyKarl Williamson2017-02-201-13/+13
| | | | | "Character class for non vertical whitespace." wasn't meant to mean match whitespace that isn't vertical.
* Prepare for Unicode 9.0Karl Williamson2016-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
* Fix some pod errorsKarl Williamson2016-04-221-1/+1
| | | | | These were discovered while testing the Pod::Checker that is intended to be used in 5.25.
* Add qr/\b{lb}/Karl Williamson2016-01-191-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* Tailor \b{wb} for PerlKarl Williamson2016-01-081-1/+18
| | | | | | | | | | | | The Unicode \b{wb} matches the boundary between space characters in a span of them. This is opposite of what \b does, and is counterintuitive to Perl expectations. This commit tailors \b{wb} to not split up spans of white space. I have submitted a request to Unicode to re-examine their algorithm, and this has been assigned to a subcommittee to look at, but the result won't be available until after 5.24 is done. In any event, Unicode encourages tailoring for local conditions.
* perlrebackslash: White-space, clarificationKarl Williamson2016-01-081-2/+4
|
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-14/+0
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* perlrebackslash: Note \b{sb} is subject to changeKarl Williamson2015-05-071-1/+2
| | | | The Unicode algorithm has big issues, and may change.
* perlrebackslash: NitKarl Williamson2015-03-291-1/+1
|
* perlrebackslash: Clarify that \b{} rules are volatileKarl Williamson2015-03-181-11/+20
|
* perlrebackslash: Add, correct \b{} textKarl Williamson2015-03-091-13/+33
| | | | This fleshes out documentation about this new feature
* perlrebackslash: NitKarl Williamson2015-03-091-1/+1
|
* perlrebackslash: Amplify and correct \b{sb}, \b{wb}Karl Williamson2015-02-211-5/+15
|
* Add \b{sb}Karl Williamson2015-02-191-0/+7
|
* Add qr/\b{wb}/Karl Williamson2015-02-191-8/+29
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-8/+31
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* perlrebackslash, perlreref.pod: White space onlyKarl Williamson2014-03-261-1/+2
| | | | Fit verbatim lines into 79 columns
* Deprecate /\C/David Mitchell2014-03-261-5/+5
| | | | | For 5.20, just say its deprecated. We'll add a warning in 5.22 and change its behaviour in 5.24.
* perlrebackslash: Add clarifying note about \XKarl Williamson2013-12-061-0/+3
|
* POD nitpicks.SHIRAKATA Kentaro2013-04-231-8/+11
| | | | | Also, rebreak some verbatim lines to avoid porting error. Update known_pod_issues.dat.
* \N was still marked experimental in some placesDavid Mitchell2013-04-191-1/+1
|
* \N is no longer experimentalKarl Williamson2013-02-271-1/+1
|
* perlrebackslash: #109408Brian Fraser2012-06-271-1/+1
|
* Clarify some quotemeta docsKarl Williamson2012-02-151-6/+10
|
* perrebackslash, perlrecharclass: Note locale effectsKarl Williamson2012-02-091-0/+3
| | | | This adds text to specify what happens under 'use locale'.
* pod updates for fc and \FBrian Fraser2012-01-291-1/+6
|
* Autoload charnames for \N{name}Karl Williamson2011-12-201-3/+1
| | | | | | | | | | | | | | | | This autoloads charnames.pm when needed. It uses the :full and :short options. :loose is not used because of its relative unfamiliarity in the Perl community, and is slower. (If someone later added a typical "use charnames qw(:full)", things that previously matched under :loose would start to fail, causing confustion. If :loose does become more common, we can change this in the future to use it; the converse isn't true.) The callable functions in the module are not automatically loaded. To access them, an explicity "use charnames" must be provided. Thanks to Tony Cook for doing a code inspection and finding a missing SPAGAIN.
* perlrebackslash: too grammer tweaksFather Chrysostomos2011-10-271-2/+2
|
* PATCH: [perl #99928] Document that is not a bugKarl Williamson2011-10-271-1/+8
| | | | | | | After consulting with Tom Christiansen, we decided that this is not a bug. The CR-LF sequence is considered a unit by Unicode, and so should be inseperable, even when separating the two would cause a pattern to match that otherwise fails.
* perlrebackslash: Add missing paren to exampleKarl Williamson2011-09-251-1/+1
|
* perlrebackslash: Slight editsKarl Williamson2011-04-151-6/+7
|
* perlrebackslash: Update for 5.14 changesKarl Williamson2011-04-121-4/+4
|
* multifile patch against blead/pod/*.podTom Christiansen2011-02-151-67/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | I mostly fixed spelling mistakes, some of very long standing, but a few files got more attentive word-smithying. I've updated: pod/perl.pod pod/perldelta.pod pod/perl592delta.pod pod/perl5120delta.pod pod/perl51310delta.pod pod/perl5139delta.pod pod/perlfunc.pod pod/perlop.pod pod/perlrebackslash.pod pod/perlrecharclass.pod pod/perlutil.pod pod/perlhack.pod pod/perlintern.pod pod/perlnetware.pod pod/perlpolicy.pod
* Add /a regex modifierKarl Williamson2011-01-171-0/+4
| | | | | This restricts certain constructs, like \w, to matching in the ASCII range only.
* Fix typos in pod/*Peter J. Acklam) (via RT2011-01-071-1/+1
| | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81906] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81906 >
* Nits in re podsKarl Williamson2010-10-311-2/+3
|
* DOCs: Clarify that \w matches marks and \PcKarl Williamson2010-10-311-3/+4
| | | | | | The previous documentation really didn't specify what \w is. It matches the underscore, but also all other connector punctuation, plus any marks, such as diacritical accents that occur within a word.
* perlrebackslash: Fix poor grammarKarl Williamson2010-10-151-2/+2
|
* Teach Perl about Unicode named character sequencesKarl Williamson2010-09-251-11/+13
| | | | | | | | | | | | | mktables is changed to process the Unicode named sequence file. charnames.pm is changed to cache the looked-up values in utf8. A new function, string_vianame is created that can handle named sequences, as the interface for vianame cannot. The subroutine lookup_name() is slightly refactored to do almost all of the common work for \N{} and the vianame routines. It now understands named sequences as created my mktables.. tests and documentation are added. In the randomized testing section, half use vianame() and half string_vianame().
* Fix casing, wordingKarl Williamson2010-09-251-2/+2
|
* Add \o{} escapeKarl Williamson2010-07-171-26/+57
| | | | | | | | | | This commit adds the new construct \o{} to express a character constant by its octal ordinal value, along with ancillary tests and documentation. A function to handle this is added to util.c, and it is called from the 3 parsing places it could occur. The function is a candidate for in-lining, though I doubt that it will ever be used frequently.
* perlrebackslash: NitsKarl Williamson2010-07-171-2/+2
| | | | Signed-off-by: David Golden <dagolden@cpan.org>