| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Mostly in comments and docs, but some in diagnostic messages and one
case of 'or die die'.
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, specifying a named sequence would result in a
mostly unhelpful fatal error message. This makes their use legal.
This is also the beginning of allowing Unicode string properties, which
are a new thing in the (still draft) Unicode requirements for regular
expression parsing, UTS 18. Full compliance will have to come later.
|
|
|
|
|
| |
This commit adds wildcard subpatterns for the Name and Name Aliases
properties.
|
|
|
|
| |
Spotted by Hugo van der Sanden
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The algorithm for dealing with Unicode property wildcards is to wrap the
user-supplied pattern with /miaa. We don't want the user to be able to
override the /m and /aa parts. Modifiers that are only specifiable as a
modifier in a qr or similar op (like /gc) can't be included in things
like (?gc). These normally incur a warning that they are ignored, but
the texts of those warnings are misleading when using wildcards, so I
chose to just make them illegal. Of course that could be changed to
having custom useful warning texts, but I didn't think it was worth it.
I also chose to forbid recursion of using nested \p{}, just from fear
that it might lead to issues down the road, and it really isn't useful
for this limited universe of strings to match against. Because
wildcards currently can't handle '}' inside them, only the single letter
\p,\P are valid anyway.
Similarly, I forbid the '*' quantifier to make it harder for the
constructed subpattern to take forever to make any progress and decide
to halt. Again, using it would be overkill on the universe of possible
match strings.
|
|
|
|
|
|
|
|
| |
This is in preparation for being called from more than one place.
It has the salubrious effect that the wrapping we do around the user's
supplied pattern is no longer visible in the Debug output of that
pattern.
|
|
|
|
|
|
|
|
|
| |
Unicode is revising their document on what regular expression
implementations should do. This includes retraction of a significant
part of it, which Perl did not handle (and apparently nobody else
either). Thus we are much closer to implementing everything they say
than before. The document is adding some new (manageable) things, which
we do not yet support.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This accomplishes the same thing as \N{...}, but only for regex
patterns, using loose matching and only the official Unicode names.
This commit includes a comparison of the two approaches, added to
perlunicode. But the real reason to do this is as a way station to
being able to specify wild card lookup on the name property, coming in a
later commit.
I chose to not include user-defined aliases nor :short character names
at this time. I thought that there might be unforeseen consequences of
using them. It's better to later relax a requirement than to try to
restrict it.
|
| |
|
|
|
|
| |
Prior to this patch, they only sometimes overrode.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
This updates to match the latest Unicode document on regular
expressions, and to incorporate changes that have happened to Perl that
didn't get updated here. It also includes new clarifications about some
of the Unicode requirements.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large commit moves the handling of user-defined properties to C
code. This should speed it up, but the main reason to do this is to
stop using swashes in this case, leaving only tr/// using them. Once
that too is converted, all swash handling can be ripped out of perl.
Doing this in perl has caused some nasty interactions that will now be
fixed automatically.
The change is not entirely transparent, however (besides speed and the
possibility of removing these interactions). perldelta in this commit
details these.
|
| |
|
|
|
|
|
|
|
|
|
| |
Alexandr Savca is now a Perl AUTHOR.
For: RT #133120
Committer: holding off on the corrections to pod/perlartistic.pod until
clarification of change to license text.
|
|
|
|
|
|
|
| |
5.10 and 5.8 are old, 5.6 is ancient archaeology you're very unlikely
to run into, but the casual reader may not know that, add the extra
emphasis in case someone's mistaken about needing to worry about this
for anything more than historical trivia.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
eg.
$ perl -Mutf8 -le 'chr(0xdf) =~ /ss/i and print "match" or print "no match"'
no match
perhaps this should be removed, or completely re-worded, it's worded
similarly to the next point which behaves differently.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
A BOM at the beginning of a UTF-8 file is ignored, and doesn't otherwise
do anything.
|
|
|
|
|
|
| |
:encoding(utf8)
For data exchange it is better to use strict UTF-8 encoding and not perl's utf8.
|
|
|
|
|
| |
For data exchange it is not good idea to use not strict perl's extended
dialect of utf8 encoding.
|
| |
|
| |
|
|
|
|
|
|
|
| |
Documentation of the Unicode Bug contains an example that nests single
quotes in a shell. Most shells can't do that. Patch attached.
Signed-off-by: Abigail <abigail@abigail.be>
|
|
|
|
|
|
|
| |
when 'foo' is a script. Also update the pods correspondingly, and to
encourage scx property use.
See http://nntp.perl.org/group/perl.perl5.porters/237403
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes a couple of nits, but mostly it updates the text to
correspond with changes in Unicode UTS#18, concerning regular
expressions, and Perl compatibility with what it says.
Note that though this Unicode document's text is written as if it were
imposing requirements, it is not technically a part of the Unicode
standard, so its "requirements" are merely suggestions or guidelines.
It turns out that several of the "requirements" that Perl didn't meet
have been retracted by Unicode (as effectively unimplementable), so the
Perl Unicode support is actually better than it appeared, and in fact,
is almost complete at the first 2 (of 3) levels of support discussed in
UTS#18.
|
|
|
|
|
| |
v5.24 reinstated the ability to compile any earlier version of the
Unicode standard into Perl, but this pod did not get updated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
|
|
| |
...not the hyphenated form
commit message by rjbs
|
|
|
|
| |
See https://rt.perl.org/Ticket/Display.html?id=115166
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes perluniprops to not list the equivalent 'In' single form
method of specifying the Block property, and to discourage its use. The
reason is that this is a Perl extension, the use of which is unstable.
A future Unicode release could take over the 'In...' name for a new
purpose, and perl would follow along, breaking the code that assumed the
former meaning. Unicode does not know about this Perl extension, and
they wouldn't care if they did know.
The reason I'm doing this now is that the latest Unicode version
introduced some properties whose names begin with 'In', though no
conflicts arose. But it is clear that such conflicts could arise in the
future. So the documentation only is changed to warn people of this
potential.
perlunicode is update accordingly.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I've always had problems understanding the point of some of the
discussion of this pod, so I've finally rewritten parts to bring it
up-to-date with modern Unicode support and clarify things.
In particular the "byte" vs "character" semantics didn't make sense to
me. Perl has always used character semantics (outside of a few places
noted in both pod versions); it's just that the advent of Unicode made
'byte' and 'character' no longer synonymous. So I've split that section
of the old pod, with the added section entitled "ASCII rules vs Unicode
rules", which I think is more clear.
|
| |
|
|
|
|
| |
This consolidates the EBCDIC problems into one place
|
|
|
|
| |
Don't redescribe things here. Also refer to perlapi.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 7.0 changed the prohibition of noncharacters to merely "not
recommend" their use. Perl continues to forbid them in strict input
checking (otherwise security issues could arise), but the discussion
about them needs to be updated to correspond with their new status.
The message raised when they are used probaby should change
correspondingly, but it is too late for 5.22 for that.
This commit deletes some text elsewhere about the noncharacter code
points. This text really wasn't germane to a discussion about UTF-8
(wherein it appeared), as the encoding is irrelevant to these code
points. They're not recommended in any UTF format.
Unicode spells the term "noncharacter" without a hyphen. This pod
changes to follow that spelling.
|
| |
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|