summaryrefslogtreecommitdiff
path: root/pod/perlre.pod
Commit message (Collapse)AuthorAgeFilesLines
* perlre: Nits, clarificationsKarl Williamson2016-03-071-4/+5
|
* standardize on "lookahead" and "lookaround"Ed Avis2015-12-071-20/+20
| | | | | | ...not the hyphenated form commit message by rjbs
* perlre: NitsKarl Williamson2015-10-301-128/+146
| | | | | This mostly adds C<> formatting, but there are a few updates, clarifications, and grammar-type fixes.
* PATCH: [perl #126177] Document /(?n)/Karl Williamson2015-10-191-12/+22
| | | | | This adds /n to various places in perlre where it was omitted, and adds a heading to better structure the document, and a clarifying sentence.
* fix perl #126186 make all verbs allow an optional argYves Orton2015-10-051-12/+11
| | | | | | | | | | | | In perl #126186 it was pointed out we had started allowing name arguments for verbs where we did not document them to be supported, albeit in an inconsistent way. The previous patch cleaned up some of the cause of this, but it seems better to just generally allow the existing verbs to all support a mark name argument. So this patch reverses the effect of the previous patch, and makes all verbs, FAIL, ACCEPT, etc, allow an optional argument, and set REGERROR/REGMARK appropriately as well.
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-5/+0
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-2/+2
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* Add qr/\b{gcb}/Karl Williamson2015-02-191-0/+12
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* perlre: NitsKarl Williamson2015-02-181-3/+3
|
* Add portablity warning for re 'strict'Karl Williamson2015-01-201-11/+19
| | | | | | | | When a range in a bracketed character class has one end be specified as Unicode, the whole range is viewed as Unicode. Currently this is not warned about, though it is somewhat like mixing apples and oranges. This commit adds a warning, but only under "use re 'strict'", and it now documents the only one-end behavior.
* Add 'strict' subpragma to 'use re'Karl Williamson2015-01-131-0/+3
| | | | | | | | | | | | | | | | | | This subpragma is to allow p5p to add warnings/errors for regex patterns without having to worry about backwards compatibility. And it allows users who want to have the latest checks on their code to do so. An experimental warning is raised by default when it is used, not because the subpragma might go away, but because what it catches is subject to change from release-to-release, and so the user is acknowledging that they waive the right to backwards compatibility. I will be working in the near term to make some changes to what is detected by this. Note that there is no indication in the pattern stringification that it was compiled under this. This means I didn't have to figure out how to stringify it. It is fine because using this doesn't affect what the pattern gets compiled into, if successful. And interpolating the stringified pattern under either strict or non-strict should both just work.
* Perldelta for /n regexp flag. Also ?: to C<?:> in perlre.pod.Matthew Horsfall2014-12-311-1/+1
|
* perlre: Fix too long verbatim lineKarl Williamson2014-12-301-1/+2
|
* Add documentation for /n (non-capture) regexp flag.Matthew Horsfall2014-12-301-1/+21
|
* Make /[\N{}-\N{}]/ match Unicodely on EBCDICKarl Williamson2014-11-241-2/+8
| | | | | | | | | This makes [\N{U+06}-\N{U+09}] match U+06, U+07, U+08, U+09 even on EBCDIC platforms, allowing one to write portable ranges. For 1047 EBCDIC this would match 0x2E, 0x2F, 0x16, and 0x05. Thanks to Yaroslave Kuzmin for finding a bug in an earlier incarnation of this patch.
* spelling: till -> untilKaren Etheridge2014-08-191-3/+3
|
* document what version \K was added inKaren Etheridge2014-08-191-1/+2
|
* pod/perlre.pod: fix typo in example codeAaron Crane2014-07-051-1/+1
|
* Deprecate unescaped literal "{" in regex patternsKarl Williamson2014-06-121-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit also causes escaped (by a backslash) "(", "[", and "{" to be considered literally. In the previous 2 Perl versions, the escaping was ignored, and a (default-on) deprecation warning was raised. Now that we have warned for 2 release cycles, we can change the meaning.of escaping to actually do something Warning when a literal left brace is not escaped by a backslash, will allow us to eventually use this character in more contexts as being meta, allowing us to extend the language. For example, the lower limit of a quantifier could be omited, and better error checking instituted, or things like \w could be followed by a {...} indicating some special word character, like \w{Greek} to restrict to just Greek word characters. We tried to do this in v5.16, and many CPAN modules changed to backslash their left braces at that time. However we had to back out that change before 5.16 shipped because it turned out that escaping a left brace in some contexts didn't work, namely when the brace would normally be a metacharacter (for example surrounding a quantifier), and the pattern delimiters were { }. Instead we raised the useless backslash warning mentioned above, which has now been there for the requisite 2 cycles. This patch partially reverts 2 patches. The first, e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted the deprecation of unescaped literal left brace. The other, 4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of the useless left-characters. Note that, as in the original attempt to deprecate, we don't raise a warning if the left brace is the first character in the pattern. This is because in that position it can't be a metacharacter, so we don't require any disambiguation, and we found that if we did raise an error, there were quite a few places where this occurred.
* /x in patterns now includes all \p{PatWS}Karl Williamson2014-05-301-0/+15
| | | | | | | | | | | | | | | | | | | | This brings Perl regular expressions more into conformance with Unicode. /x now accepts 5 additional characters as white space. Use of these characters as literals under /x has been deprecated since 5.18, so now we are free to change what they mean. This commit eliminates the static function that processes the old whitespace definition (and a generated macro that was used only for this), using the already existing one for the new definition. It refactors slightly the static function that skips comments to mesh better with the needs of its callers, and calls it in one place where before the code was essentially duplicated. p5p discussion starting in http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that the (?[ ]) comments should be terminated the same way as regular /x comments, and this was also done in this commit. No prior notice is necessary as this is an experimental feature.
* perlre: Clarify /x eol can't be escapedKarl Williamson2014-05-291-0/+2
|
* perlre: Update obsolete exampleKarl Williamson2014-05-081-4/+3
| | | | | /foo{4,3}/ now emits a message, contrary to what the pod claims. Use a different example that doesn't emit a message
* Deprecate /\C/David Mitchell2014-03-261-1/+1
| | | | | For 5.20, just say its deprecated. We'll add a warning in 5.22 and change its behaviour in 5.24.
* fix RT #121299 - Inconsistent behavior with backreferences nested inside ↵Yves Orton2014-02-241-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | subpattern references Match variables should be dynamically scoped during GOSUB and GOSTART. The callers state should be inherited by the callee, but once the callee returns, the callers state should be restored. This is different from EVAL, where the callers and callees state are expected to not be the same (although might be the same), and where the "reasonable" match semantics differ. Currently the following two one liners will produce different results: $ ./perl -Ilib -le'"<ab><>>" =~/ < (?: \1 | [ab]+ ) (>) (?0)? /x and print $&;' <ab><>> $ ./perl -Ilib -le'$qr= qr/ < (?: \1 | [ab]+ ) (>) (??{ $qr })? /x; "<ab><>>" =~ m/$qr/ and print $&;' <ab> While I think reasonable people could argue that we should special case things when we know that the return from (??{ ... }) is the same as the currently executing pattern I think explaining the difference would be harder than necessary. On the contrary making GOSUB/GOSTART exactly the same as EVAL, so that the match vars were totally independent seems to throw away an opportunity for much more powerful semantics than can be offered by EVAL.
* Change 'semantics' to 'rules'Karl Williamson2014-02-201-3/+3
| | | | | | The term 'semantics' in documentation when applied to character sets is changed to 'rules' as being a shorter less-jargony synonym in this case. This was discussed several releases ago, but I didn't get around to it.
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-8/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* perlre: Expand and clarify /x and (?# comment)Karl Williamson2013-10-301-14/+35
| | | | | | This pod omitted some of the details about how comments in regexes work. I had to do some experimentation to find some of the answers. I believe this clarifies things as well.
* standardize perlre cross-refs to operator-based flagsRicardo Signes2013-09-291-16/+20
|
* reword the description of what the /m flag doesRicardo Signes2013-09-291-2/+2
|
* slightly clarify the meaning of $ in regexRicardo Signes2013-09-291-1/+2
| | | | This patch was suggested in #116773.
* Document non-destructive substitution: the '/r' modifier.James E Keenan2013-08-071-0/+9
| | | | | | Per bug report filed by Jacinta Richardson++. For: RT #119151
* s/.(?=.\G)/X/g: refuse to go backwardsDavid Mitchell2013-07-281-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | On something like: $_ = "123456789"; pos = 6; s/.(?=.\G)/X/g; each iteration could in theory start with pos one character to the left of the previous position, and with the substitution replacing bits that it has already replaced. Since that way madness lies, ban any attempt by s/// to substitute to the left of a previous position. To implement this, add a new flag to regexec(), REXEC_FAIL_ON_UNDERFLOW. This tells regexec() to return failure even if the match itself succeeded, but where the start of $& is before the passed stringarg point. This change caused one existing test to fail (which was added about a year ago): $_="abcdef"; s/bc|(.)\G(.)/$1 ? "[$1-$2]" : "XX"/ge; print; # used to print "aXX[c-d][d-e][e-f]"; now prints "aXXdef" I think that that test relies on ambiguous behaviour, and that my change makes things saner. Note that s/// with \G is generally very under-tested.
* perlexperiment: (?{}) and (??{}) are not experimentalRicardo Signes2013-06-231-12/+45
| | | | | | ...but we need some more explanation of its limitations. This text was provided by Yves Orton on perl5-porters in message <CANgJU+UXO7tKZgOvbwufFxAjupOcKVPdDBNkRrT7DWKdv9tBgw@mail.gmail.com>
* perlexperiment: mark regexp backtracking verbs as acceptedRicardo Signes2013-06-231-7/+0
|
* Possessive and non greedy quantifier modifiers are mutually exclusiveYves Orton2013-06-131-1/+11
| | | | | | | | | | | | | When I added support for possessive modifiers it was possible to build perl so that they could be combined even if it made no sense to do so. Later on in relation to Perl #118375 I got confused and thought they were undocumented but legal. So to prevent further confusion, and since nobody has every mentioned it since they were added, I am removing the unusued conditional build logic, and clearly documenting why they aren't allowed.
* re-enable Copy-on-Write by default.David Mitchell2013-05-261-11/+20
| | | | | | | | | | | | COW was first introduced (and enabled by default) in 5.17.7. It was disabled by default in 5.17.10, because it was though to have too many rough edges for the 5.18.0 release. By re-enabling it now, early in the 5.19.x release cycle, hopefully it will be ready for production use by 5.20. This commit mainly reverts 9f351b45f4 and e1fd41328c (with modifications), then updates perldelta.
* typo fix for re pod change use of optimise to be consistent with other uses ↵David Steinbrunner2013-05-251-2/+2
| | | | of optimize
* Revert "Update docs to concur with $`,$&,$' changes"David Mitchell2013-05-061-11/+8
| | | | | | | | This reverts commit d78f32f607952d58a998c5b7554572320dc57b2a. Since COW has now not been enabled by default for 5.18, revert the documentation changes which say that that $' etc no longer have a performance penalty, etc.
* Deprecate spaces/comments in some regex tokensKarl Williamson2013-05-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | This commit deprecates having space/comments between the first two characters of regular expression forms '(*VERB:ARG)' and '(?...)'. That is, the '(' should be immediately be followed by the '*' or '?'. Previously, things like: qr/((?# This is a comment in the middle of a token)?:foo)/ were silently accepted. The problem is that during regex parsing, the '(' is seen, and the input pointer advanced, skipping comments and, under /x, white space, without regard to whether the left parenthesis is part of a bigger token or not. S_reg() handles the parsing of what comes next in the input, and it just assumes that the first character it sees is the one that immediately followed the input parenthesis. Since the parenthesis may or may not be a part of a bigger token, and the current structure of handling things, I don't see an elegant way to fix this. What I did is flag the single call to S_reg() where this could be an issue, and have S_reg check for for adjacency if the parenthesis is part of a bigger token, and if so, warn if not-adjacent.
* pod/perlre: Italicize text to indicate non-literalKarl Williamson2013-03-181-7/+9
| | | | Also, changes a reference to the section into an actual link.
* EBCDIC has the Unicode bug tooKarl Williamson2013-03-111-10/+2
| | | | | | | | | | | | We have not had a working modern Perl on EBCDIC for some years. When I started out, comments and code led me to conclude erroneously that natively it supported semantics for all 256 characters 0-255. It turns out that I was wrong; it natively (at least on some platforms) has the same rules (essentially none) for the characters which don't correspond to ASCII onees, as the rules for these on ASCII platforms. This commit is documentation only, mostly just removing the special mentions of EBCDIC.
* \N is no longer experimentalKarl Williamson2013-02-271-2/+1
|
* Move (?[]) doc to perlrecharclassKarl Williamson2013-02-241-203/+6
| | | | | This is a form of bracketed character class, and so should be documented in the same pod.
* Document \s change for VT, commit 075b9d7d9a6d4473b240a047655e507c8baa6db3Karl Williamson2013-02-241-1/+2
|
* Add tests and clarify pod for (?[ ])Karl Williamson2013-02-041-3/+11
| | | | | | | A compliled '(?[ ])' embedded in a larger one is unaffected by what regex modifiers are in effect at the time of the compilation of the outer one; it retains, going forward, the modifiers it had when it was first encountered.
* Add interpolations to regex setsKarl Williamson2013-02-031-0/+26
| | | | | | | | | This commit adds the capability for '(?[ ])' to contain interpolated variables from other '(?[ ])' constructs. A set operation can thus be built up from the composition of other ones, without having to worry about precedence, etc. Thanks to Aaron Crane for suggesting this.
* Incorporate code review feedback for (?[])Karl Williamson2013-02-031-35/+72
| | | | Thanks to Hugo van der Sanden for reviewing this new code.
* perlre: Fix typoKarl Williamson2013-01-191-1/+1
| | | | | "<" isn't a metacharacter, therefore "\<" doesn't change its meaning. "[" is a metacharacter, therefore "\[" does change its meaning.
* Deprecate certain rare uses of backslashes within regexesKarl Williamson2013-01-191-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are three pairs of characters that Perl recognizes as metacharacters in regular expression patterns: {}, [], and (). These can be used as well to delimit patterns, as in: m{foo} s(foo)(bar) Since they are metacharacters, they have special meaning to regular expression patterns, and it turns out that you can't turn off that special meaning by the normal means of preceding them with a backslash, if you use them, paired, within a pattern delimitted by them. For example, in m{foo\{1,3\}} the backslashes do not change the behavior, and this matches "f", "o" followed by one to three more occurrences of "o". Usages like this, where they are interpreted as metacharacters, are exceedingly rare; we think there are none, for example, in all of CPAN. Hence, this deprecation should affect very little code. It does give notice, however, that any such code needs to change, which will in turn allow us to change the behavior in future Perl versions so that the backslashes do have an effect, and without fear that we are silently breaking any existing code. =head1 Performance Enhancements
* perlre: fix typoAaron Crane2013-01-121-1/+1
|