summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
Commit message (Collapse)AuthorAgeFilesLines
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-3/+3
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* Add named sequences to Unicode wildcard name capabilitesKarl Williamson2020-03-201-10/+6
| | | | | | | | | Prior to this commit, specifying a named sequence would result in a mostly unhelpful fatal error message. This makes their use legal. This is also the beginning of allowing Unicode string properties, which are a new thing in the (still draft) Unicode requirements for regular expression parsing, UTS 18. Full compliance will have to come later.
* Implement \p{Name=/.../} wildcardsKarl Williamson2020-03-111-10/+38
| | | | | This commit adds wildcard subpatterns for the Name and Name Aliases properties.
* perlunicode: Fix typoKarl Williamson2020-03-091-1/+1
| | | | Spotted by Hugo van der Sanden
* Restrict features in wildcardsKarl Williamson2020-02-191-3/+16
| | | | | | | | | | | | | | | | | | | | | | The algorithm for dealing with Unicode property wildcards is to wrap the user-supplied pattern with /miaa. We don't want the user to be able to override the /m and /aa parts. Modifiers that are only specifiable as a modifier in a qr or similar op (like /gc) can't be included in things like (?gc). These normally incur a warning that they are ignored, but the texts of those warnings are misleading when using wildcards, so I chose to just make them illegal. Of course that could be changed to having custom useful warning texts, but I didn't think it was worth it. I also chose to forbid recursion of using nested \p{}, just from fear that it might lead to issues down the road, and it really isn't useful for this limited universe of strings to match against. Because wildcards currently can't handle '}' inside them, only the single letter \p,\P are valid anyway. Similarly, I forbid the '*' quantifier to make it harder for the constructed subpattern to take forever to make any progress and decide to halt. Again, using it would be overkill on the universe of possible match strings.
* regcomp.c: Add wrappers for cmplng/xctng wildcard subpatternsKarl Williamson2020-02-191-6/+2
| | | | | | | | This is in preparation for being called from more than one place. It has the salubrious effect that the wrapping we do around the user's supplied pattern is no longer visible in the Debug output of that pattern.
* Update perlunicode base on Unicode UTS 18, regex reqsKarl Williamson2020-02-151-37/+17
| | | | | | | | | Unicode is revising their document on what regular expression implementations should do. This includes retraction of a significant part of it, which Perl did not handle (and apparently nobody else either). Thus we are much closer to implementing everything they say than before. The document is adding some new (manageable) things, which we do not yet support.
* Add qr/\p{Name=...}/Karl Williamson2020-02-121-9/+68
| | | | | | | | | | | | | | | This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
* perlunicode: Slight clarificationKarl Williamson2020-02-051-1/+2
|
* PATCH GH #17025 \p{user-defined} overrides official UnicodeKarl Williamson2019-12-091-1/+3
| | | | Prior to this patch, they only sometimes overrode.
* Accept experimental script_run featureKarl Williamson2019-10-311-1/+0
|
* Move more URLs from http:// to https://Max Maischein2019-10-111-18/+18
|
* Add Unicode property wildcardsKarl Williamson2019-03-121-2/+144
|
* perlunicode: Update, clarifyKarl Williamson2019-03-121-16/+24
| | | | | | | This updates to match the latest Unicode document on regular expressions, and to incorporate changes that have happened to Perl that didn't get updated here. It also includes new clarifications about some of the Unicode requirements.
* Move \p{user-defined} to core from utf8_heavy.plKarl Williamson2019-02-141-1/+2
| | | | | | | | | | | | | | This large commit moves the handling of user-defined properties to C code. This should speed it up, but the main reason to do this is to stop using swashes in this case, leaving only tr/// using them. Once that too is converted, all swash handling can be ripped out of perl. Doing this in perl has caused some nasty interactions that will now be fixed automatically. The change is not entirely transparent, however (besides speed and the possibility of removing these interactions). perldelta in this commit details these.
* perlunicode: Clarifiy user-defined propsKarl Williamson2018-08-051-1/+1
|
* Spelling corrections in pod/*.pod from Alexandr Savca.Alexandr Savca2018-04-191-2/+2
| | | | | | | | | Alexandr Savca is now a Perl AUTHOR. For: RT #133120 Committer: holding off on the corrections to pod/perlartistic.pod until clarification of change to license text.
* pod: start referring to 5.6 and pre-5.6 as "ancient" instead of just "old"Ævar Arnfjörð Bjarmason2017-12-061-1/+1
| | | | | | | 5.10 and 5.8 are old, 5.6 is ancient archaeology you're very unlikely to run into, but the casual reader may not know that, add the extra emphasis in case someone's mistaken about needing to worry about this for anything more than historical trivia.
* encoding.pm no longer worksTony Cook2017-07-241-2/+3
|
* use utf8; doesn't force unicode semantics on all strings in scopeTony Cook2017-07-241-1/+1
| | | | | | | | | | eg. $ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"' no match perhaps this should be removed, or completely re-worded, it's worded similarly to the next point which behaves differently.
* RT #130907: Fix the Unicode Bug in split " "Aaron Crane2017-07-151-0/+11
|
* Update pods about bitwise UTF-8 above 0xFF being fatalKarl Williamson2017-06-071-5/+6
|
* perlunicode: Update text about malformed UTF-8Karl Williamson2017-04-111-11/+19
|
* perlunicode: Add linkKarl Williamson2017-04-091-1/+2
|
* PATCH: [perl #121292] wrong perlunicode BOM claimsKarl Williamson2017-04-081-6/+8
| | | | | A BOM at the beginning of a UTF-8 file is ignored, and doesn't otherwise do anything.
* pod: Suggest to use strict :encoding(UTF-8) PerlIO layer over not strict ↵Pali2017-02-061-1/+1
| | | | | | :encoding(utf8) For data exchange it is better to use strict UTF-8 encoding and not perl's utf8.
* pod: Suggest to use strict UTF-8 encoding when dealing with external dataPali2017-01-261-4/+4
| | | | | For data exchange it is not good idea to use not strict perl's extended dialect of utf8 encoding.
* Fix the Unicode Bug in the range operatorAaron Crane2017-01-051-0/+10
|
* perlunicode: Fix typoKarl Williamson2016-09-171-1/+1
|
* Nested single quotes in documentation exampleE. Choroba2016-09-161-1/+1
| | | | | | | Documentation of the Unicode Bug contains an example that nests single quotes in a shell. Most shells can't do that. Patch attached. Signed-off-by: Abigail <abigail@abigail.be>
* Change \p{foo} to mean \p{scx: foo}Karl Williamson2016-06-301-12/+23
| | | | | | | when 'foo' is a script. Also update the pods correspondingly, and to encourage scx property use. See http://nntp.perl.org/group/perl.perl5.porters/237403
* perlunicode typoFather Chrysostomos2016-06-261-1/+1
|
* Update perlunicodeKarl Williamson2016-06-261-78/+94
| | | | | | | | | | | | | | | | This fixes a couple of nits, but mostly it updates the text to correspond with changes in Unicode UTS#18, concerning regular expressions, and Perl compatibility with what it says. Note that though this Unicode document's text is written as if it were imposing requirements, it is not technically a part of the Unicode standard, so its "requirements" are merely suggestions or guidelines. It turns out that several of the "requirements" that Perl didn't meet have been retracted by Unicode (as effectively unimplementable), so the Perl Unicode support is actually better than it appeared, and in fact, is almost complete at the first 2 (of 3) levels of support discussed in UTS#18.
* perlunicode: Fix mistatementKarl Williamson2016-06-261-1/+1
| | | | | v5.24 reinstated the ability to compile any earlier version of the Unicode standard into Perl, but this pod did not get updated.
* Add qr/\b{lb}/Karl Williamson2016-01-191-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* standardize on "lookahead" and "lookaround"Ed Avis2015-12-071-2/+2
| | | | | | ...not the hyphenated form commit message by rjbs
* Deprecate Unicode code points above IV_MAXKarl Williamson2015-11-281-1/+4
| | | | See https://rt.perl.org/Ticket/Display.html?id=115166
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-251-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* pods: Discourage use of 'In' prefix for Unicode Block propertyKarl Williamson2015-09-111-24/+28
| | | | | | | | | | | | | | | | | | This changes perluniprops to not list the equivalent 'In' single form method of specifying the Block property, and to discourage its use. The reason is that this is a Perl extension, the use of which is unstable. A future Unicode release could take over the 'In...' name for a new purpose, and perl would follow along, breaking the code that assumed the former meaning. Unicode does not know about this Perl extension, and they wouldn't care if they did know. The reason I'm doing this now is that the latest Unicode version introduced some properties whose names begin with 'In', though no conflicts arose. But it is clear that such conflicts could arise in the future. So the documentation only is changed to warn people of this potential. perlunicode is update accordingly.
* PATCH: [perl #125947] Doc error in perlunicodeKarl Williamson2015-09-011-1/+1
|
* perlunicode: Fix small misstatementKarl Williamson2015-05-081-3/+4
|
* perlunicode: RevampKarl Williamson2015-05-071-457/+558
| | | | | | | | | | | | | I've always had problems understanding the point of some of the discussion of this pod, so I've finally rewritten parts to bring it up-to-date with modern Unicode support and clarify things. In particular the "byte" vs "character" semantics didn't make sense to me. Perl has always used character semantics (outside of a few places noted in both pod versions); it's just that the advent of Unicode made 'byte' and 'character' no longer synonymous. So I've split that section of the old pod, with the added section entitled "ASCII rules vs Unicode rules", which I think is more clear.
* perlunicode: Nits, minor fixesKarl Williamson2015-05-071-33/+41
|
* perlebcdic: Move text from perlunicodeKarl Williamson2015-05-071-13/+0
| | | | This consolidates the EBCDIC problems into one place
* perlunicode: Refer to perlguts for XS handlingKarl Williamson2015-05-071-105/+3
| | | | Don't redescribe things here. Also refer to perlapi.
* perlunicode: Update nonchars discussion for Unicode 7.0Karl Williamson2015-05-041-21/+69
| | | | | | | | | | | | | | | | | | Unicode 7.0 changed the prohibition of noncharacters to merely "not recommend" their use. Perl continues to forbid them in strict input checking (otherwise security issues could arise), but the discussion about them needs to be updated to correspond with their new status. The message raised when they are used probaby should change correspondingly, but it is too late for 5.22 for that. This commit deletes some text elsewhere about the noncharacter code points. This text really wasn't germane to a discussion about UTF-8 (wherein it appeared), as the encoding is irrelevant to these code points. They're not recommended in any UTF format. Unicode spells the term "noncharacter" without a hyphen. This pod changes to follow that spelling.
* perlunicode: Nit, for EBCDICKarl Williamson2015-03-161-1/+1
|
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-2/+2
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* Add qr/\b{wb}/Karl Williamson2015-02-191-1/+1
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-3/+5
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.