| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
| |
...not the hyphenated form
commit message by rjbs
|
|
|
|
|
| |
This mostly adds C<> formatting, but there are a few updates,
clarifications, and grammar-type fixes.
|
|
|
|
|
| |
This adds /n to various places in perlre where it was omitted, and adds
a heading to better structure the document, and a clarifying sentence.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In perl #126186 it was pointed out we had started allowing name
arguments for verbs where we did not document them to be supported,
albeit in an inconsistent way. The previous patch cleaned up some
of the cause of this, but it seems better to just generally allow
the existing verbs to all support a mark name argument.
So this patch reverses the effect of the previous patch, and makes
all verbs, FAIL, ACCEPT, etc, allow an optional argument, and
set REGERROR/REGMARK appropriately as well.
|
|
|
|
|
|
| |
This horrible thing broke encapsulation and was as buggy as a very buggy
thing. It's been officially deprecated since 5.20.0 and now it can finally
die die die!!!!
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
| |
|
|
|
|
|
|
|
|
| |
When a range in a bracketed character class has one end be specified as
Unicode, the whole range is viewed as Unicode. Currently this is not
warned about, though it is somewhat like mixing apples and oranges.
This commit adds a warning, but only under "use re 'strict'", and
it now documents the only one-end behavior.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This subpragma is to allow p5p to add warnings/errors for regex patterns
without having to worry about backwards compatibility. And it allows
users who want to have the latest checks on their code to do so. An
experimental warning is raised by default when it is used, not because
the subpragma might go away, but because what it catches is subject to
change from release-to-release, and so the user is acknowledging that
they waive the right to backwards compatibility. I will be working in
the near term to make some changes to what is detected by this.
Note that there is no indication in the pattern stringification that it
was compiled under this. This means I didn't have to figure out how to
stringify it. It is fine because using this doesn't affect what the
pattern gets compiled into, if successful. And interpolating the
stringified pattern under either strict or non-strict should both just
work.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
This makes [\N{U+06}-\N{U+09}] match U+06, U+07, U+08, U+09 even on
EBCDIC platforms, allowing one to write portable ranges. For 1047
EBCDIC this would match 0x2E, 0x2F, 0x16, and 0x05.
Thanks to Yaroslave Kuzmin for finding a bug in an earlier incarnation
of this patch.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit also causes escaped (by a backslash) "(", "[", and "{" to be
considered literally. In the previous 2 Perl versions, the escaping was
ignored, and a (default-on) deprecation warning was raised. Now that we
have warned for 2 release cycles, we can change the meaning.of escaping
to actually do something
Warning when a literal left brace is not escaped by a backslash, will
allow us to eventually use this character in more contexts as being
meta, allowing us to extend the language. For example, the lower limit
of a quantifier could be omited, and better error checking instituted,
or things like \w could be followed by a {...} indicating some special
word character, like \w{Greek} to restrict to just Greek word
characters.
We tried to do this in v5.16, and many CPAN modules changed to backslash
their left braces at that time. However we had to back out that change
before 5.16 shipped because it turned out that escaping a left brace in
some contexts didn't work, namely when the brace would normally be a
metacharacter (for example surrounding a quantifier), and the pattern
delimiters were { }. Instead we raised the useless backslash warning
mentioned above, which has now been there for the requisite 2 cycles.
This patch partially reverts 2 patches. The first,
e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted
the deprecation of unescaped literal left brace. The other,
4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of
the useless left-characters.
Note that, as in the original attempt to deprecate, we don't raise a
warning if the left brace is the first character in the pattern. This
is because in that position it can't be a metacharacter, so we don't
require any disambiguation, and we found that if we did raise an error,
there were quite a few places where this occurred.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This brings Perl regular expressions more into conformance with Unicode.
/x now accepts 5 additional characters as white space. Use of these
characters as literals under /x has been deprecated since 5.18, so now
we are free to change what they mean.
This commit eliminates the static function that processes the old
whitespace definition (and a generated macro that was used only for
this), using the already existing one for the new definition. It
refactors slightly the static function that skips comments to mesh
better with the needs of its callers, and calls it in one place where
before the code was essentially duplicated.
p5p discussion starting in
http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that
the (?[ ]) comments should be terminated the same way as regular /x
comments, and this was also done in this commit. No prior notice is
necessary as this is an experimental feature.
|
| |
|
|
|
|
|
| |
/foo{4,3}/ now emits a message, contrary to what the pod claims. Use a
different example that doesn't emit a message
|
|
|
|
|
| |
For 5.20, just say its deprecated. We'll add a warning in 5.22
and change its behaviour in 5.24.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
subpattern references
Match variables should be dynamically scoped during GOSUB and GOSTART.
The callers state should be inherited by the callee, but once the callee
returns, the callers state should be restored.
This is different from EVAL, where the callers and callees state are
expected to not be the same (although might be the same), and where
the "reasonable" match semantics differ. Currently the following two
one liners will produce different results:
$ ./perl -Ilib -le'"<ab><>>" =~/ < (?: \1 | [ab]+ ) (>) (?0)? /x and print $&;'
<ab><>>
$ ./perl -Ilib -le'$qr= qr/ < (?: \1 | [ab]+ ) (>) (??{ $qr })? /x; "<ab><>>" =~ m/$qr/ and print $&;'
<ab>
While I think reasonable people could argue that we should special case
things when we know that the return from (??{ ... }) is the same as the
currently executing pattern I think explaining the difference would be
harder than necessary.
On the contrary making GOSUB/GOSTART exactly the same as EVAL, so that
the match vars were totally independent seems to throw away an
opportunity for much more powerful semantics than can be offered by
EVAL.
|
|
|
|
|
|
| |
The term 'semantics' in documentation when applied to character sets is
changed to 'rules' as being a shorter less-jargony synonym in this case.
This was discussed several releases ago, but I didn't get around to it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
|
| |
This pod omitted some of the details about how comments in regexes work.
I had to do some experimentation to find some of the answers.
I believe this clarifies things as well.
|
| |
|
| |
|
|
|
|
| |
This patch was suggested in #116773.
|
|
|
|
|
|
| |
Per bug report filed by Jacinta Richardson++.
For: RT #119151
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On something like:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
each iteration could in theory start with pos one character to the left
of the previous position, and with the substitution replacing bits that
it has already replaced. Since that way madness lies, ban any attempt by
s/// to substitute to the left of a previous position.
To implement this, add a new flag to regexec(), REXEC_FAIL_ON_UNDERFLOW.
This tells regexec() to return failure even if the match itself succeeded,
but where the start of $& is before the passed stringarg point.
This change caused one existing test to fail (which was added about a year
ago):
$_="abcdef";
s/bc|(.)\G(.)/$1 ? "[$1-$2]" : "XX"/ge;
print; # used to print "aXX[c-d][d-e][e-f]"; now prints "aXXdef"
I think that that test relies on ambiguous behaviour, and that my change
makes things saner.
Note that s/// with \G is generally very under-tested.
|
|
|
|
|
|
| |
...but we need some more explanation of its limitations. This text
was provided by Yves Orton on perl5-porters in message
<CANgJU+UXO7tKZgOvbwufFxAjupOcKVPdDBNkRrT7DWKdv9tBgw@mail.gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I added support for possessive modifiers it was possible to
build perl so that they could be combined even if it made no sense
to do so.
Later on in relation to Perl #118375 I got confused and thought
they were undocumented but legal.
So to prevent further confusion, and since nobody has every mentioned
it since they were added, I am removing the unusued conditional
build logic, and clearly documenting why they aren't allowed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
COW was first introduced (and enabled by default) in 5.17.7.
It was disabled by default in 5.17.10, because it was though to have too
many rough edges for the 5.18.0 release.
By re-enabling it now, early in the 5.19.x release cycle, hopefully it
will be ready for production use by 5.20.
This commit mainly reverts 9f351b45f4 and e1fd41328c (with modifications),
then updates perldelta.
|
|
|
|
| |
of optimize
|
|
|
|
|
|
|
|
| |
This reverts commit d78f32f607952d58a998c5b7554572320dc57b2a.
Since COW has now not been enabled by default for 5.18, revert the
documentation changes which say that that $' etc no longer have a
performance penalty, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit deprecates having space/comments between the first two
characters of regular expression forms '(*VERB:ARG)' and '(?...)'.
That is, the '(' should be immediately be followed by the '*' or '?'.
Previously, things like:
qr/((?# This is a comment in the middle of a token)?:foo)/
were silently accepted.
The problem is that during regex parsing, the '(' is seen, and the input
pointer advanced, skipping comments and, under /x, white space, without
regard to whether the left parenthesis is part of a bigger token or not.
S_reg() handles the parsing of what comes next in the input, and it
just assumes that the first character it sees is the one that
immediately followed the input parenthesis.
Since the parenthesis may or may not be a part of a bigger token, and
the current structure of handling things, I don't see an elegant way to
fix this. What I did is flag the single call to S_reg() where this
could be an issue, and have S_reg check for for adjacency if the
parenthesis is part of a bigger token, and if so, warn if not-adjacent.
|
|
|
|
| |
Also, changes a reference to the section into an actual link.
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have not had a working modern Perl on EBCDIC for some years. When I
started out, comments and code led me to conclude erroneously that
natively it supported semantics for all 256 characters 0-255. It turns
out that I was wrong; it natively (at least on some platforms) has the
same rules (essentially none) for the characters which don't correspond
to ASCII onees, as the rules for these on ASCII platforms.
This commit is documentation only, mostly just removing the special
mentions of EBCDIC.
|
| |
|
|
|
|
|
| |
This is a form of bracketed character class, and so should be documented
in the same pod.
|
| |
|
|
|
|
|
|
|
| |
A compliled '(?[ ])' embedded in a larger one is unaffected by what
regex modifiers are in effect at the time of the compilation of the
outer one; it retains, going forward, the modifiers it had when it was
first encountered.
|
|
|
|
|
|
|
|
|
| |
This commit adds the capability for '(?[ ])' to contain interpolated
variables from other '(?[ ])' constructs. A set operation can thus be
built up from the composition of other ones, without having to worry
about precedence, etc.
Thanks to Aaron Crane for suggesting this.
|
|
|
|
| |
Thanks to Hugo van der Sanden for reviewing this new code.
|
|
|
|
|
| |
"<" isn't a metacharacter, therefore "\<" doesn't change its meaning.
"[" is a metacharacter, therefore "\[" does change its meaning.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are three pairs of characters that Perl recognizes as
metacharacters in regular expression patterns: {}, [], and (). These
can be used as well to delimit patterns, as in:
m{foo}
s(foo)(bar)
Since they are metacharacters, they have special meaning to regular
expression patterns, and it turns out that you can't turn off that
special meaning by the normal means of preceding them with a backslash,
if you use them, paired, within a pattern delimitted by them. For
example, in
m{foo\{1,3\}}
the backslashes do not change the behavior, and this matches "f", "o"
followed by one to three more occurrences of "o".
Usages like this, where they are interpreted as metacharacters, are
exceedingly rare; we think there are none, for example, in all of CPAN.
Hence, this deprecation should affect very little code. It does give
notice, however, that any such code needs to change, which will in turn
allow us to change the behavior in future Perl versions so that the
backslashes do have an effect, and without fear that we are silently
breaking any existing code.
=head1 Performance Enhancements
|
| |
|