| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
t/re/regex_sets.t is actually handled by regexp.t, skipping all tests
that don't have a [bracketed character class]. Prior to this commit,
\[ and \c[ were thought to be such a class, when in fact they aren't.
|
|
|
|
|
| |
This adds the capability to specify that a test is to be done only on an
ASCII platform, or only on an EBCDIC.
|
|
|
|
|
|
|
|
|
| |
This adds code to the processing of the tests in t/re/re_tests to
automatically convert most character constants from unicode to native
character sets. This allows most tests in t/re/re_tests to be run on
both platforms without change. A later commit will add the capability
to skip individual tests if on the wrong platform, so those few tests
that this commit doesn't work for can be accommodated
|
| |
|
|
|
|
|
| |
As of 2db3e09128, attempts to load Unicode tables under miniperl croak
instead of failing silently.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While minitest passes all its tests when everything has been
built, it is sometimes useful to run it when nothing has been
built but miniperl (especially when one is working on low-level
stuff that breaks miniperl). Many tests fail if things have
not been built yet because miniperl can’t find modules like
re.pm. This patch fixes up some tests to find those modules
and changes _charnames.pm to load File::Spec only when it
needs it.
There are still many more failures, but I’ll leave the rest
for another time (or another hacker :-).
|
| |
|
|
|
|
| |
"aaabc" should match /a+?(*THEN)bc/ with "abc".
|
|
|
|
|
|
|
|
|
|
| |
This adds testing of (?[ ]), using the same tests, t/re/re_tests<
as are used by many of the regular expression .t files. Basically, it
converts the [bracketed] character classes in these tests to the new
syntax and verifies that they work there.
Some tests won't work in one or the other, and the capability to skip
depending on the .t is added
|
|
|
|
| |
This is easier to read.
|
|
|
|
|
|
| |
This reorders some if elsif ... blocks so that skip is tested for and
done before actually trying the test. This only affected tests which
were supposed to generate compiler errors.
|
|
|
|
|
|
| |
This .t works fine unless there are failures that it tries to output,
and the handle hasn't been opened using utf8. Because we aren't sure if
that operation works, just turn off warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change 1e9285c2ad54ae39 refactored Data::Dumper to load on miniperl.
t/re/regexp.t attempts to load Data::Dumper (in an eval) to display failure
output, including the failure of TODO tests. Hence Data::Dumper is now loaded
without error as part of minitest, so regexp.t then attempts to use
Data::Dumper to output better diagnostics. This fails (hard) because
Data::Dumper attempts to load Scalar::Util, which attempts to load B, which
bails out because this is miniperl.
It's not obvious that there's a 100% solution here that gets full-on
Data::Dumper functionality for miniperl.
|
|
|
|
|
|
|
|
| |
Eliminate the declaration of $numtests, unused since commit 1a6108908b085da4.
Convert $iters and $OP to lexicals. Remove the vestigial logic for finding
t/re/re_tests - the MacOS classic style pathname is redundant now, and the
file can never be found at t/re/re_tests given that there is a chdir 't' in
the BEGIN block.
|
| |
|
|
|
|
|
| |
Some test platforms don't like unexpected output without the comment
prefix character
|
|
|
|
|
|
| |
t/re/regexp_qr_embed_thr.t is the only place that sets $::qr_embed_thr, so move
the special-case startup logic related to it from t/re/regexp.t to
t/re/regexp_qr_embed_thr.t. Use the skip_all_*() functions from test.pl
|
|
|
|
| |
Annotate that all tests for %+ and %- are to be skipped on miniperl.
|
|
|
|
|
|
|
| |
# New Ticket Created by (Peter J. Acklam)
# Please include the string: [perl #81916]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81916 >
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
make regen embed.fnc
needs to be run on this patch.
This patch fixes Bugs #56444 and #62056.
Hopefully we have finally gotten this right. The parser used to handle
all the escaped constants, expanding \x2e to its single byte equivalent.
The problem is that for regexp patterns, this is a '.', which is a
metacharacter and has special meaning that \x2e does not. So things
were changed so that the parser didn't expand things in patterns. But
this causes problems for \N{NAME}, when the pattern doesn't get
evaluated until runtime, as for example when it has a scalar reference
in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was
in effect at the point during the parsing phase that this regex was
encountered in, but we don't actually look at it until runtime, when
these bug reports show that it is gone. The solution is for the
tokenizer to parse \N{NAME}, but to compile it into an intermediate
value that won't ever be considered a metacharacter. We have chosen to
compile NAME to its equivalent code point value, and express it in the
already existing \N{U+...} form. This indicates to the regex compiler
that the original input was a named character and retains the value it
had at that point in the parse.
This means that \N{U+...} now always must imply Unicode semantics for
the string or pattern it appeared in. Previously there was an
inconsistency, where effectively \N{NAME} implied Unicode semantics, but
\N{U+...} did not necessarily. So now, any string or pattern that has
either of these forms is utf8 upgraded.
A complication is that a charnames handler can return a sequence of
multiple characters instead of just one. To deal with this case, the
tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where
c1 etc are the individual characters. Perhaps this will be made a
public interface someday, but I decided to not expose it externally as
far as possible for now in case we find reason to change it. It is
possible to defeat this by passing it in a single quoted string to the
regex compiler, so the documentation will be changed to discourage that.
A further complication is that \N can have an additional meaning: to
match a non-newline. This means that the two meanings have to be
disambiguated.
embed.fnc was changed to make public the function regcurly() in
regcomp.c so that it could be referred to in toke.c to see if the ... in
\N{...} is a legal quantifier like {2,}. This is used in the
disambiguation.
toke.c was changed to update some out-dated relevant comments.
It now parses \N in patterns. If it determines that it isn't a named
sequence, it passes it through unchanged. This happens when there is no
brace after the \N, or no closing brace, or if the braces enclose a
legal quantifier. Previously there has been essentially no restriction
on what can come between the braces so that a custom translator can
accept virtually anything. Now, legal quantifiers are assumed to mean
that the \N is a "match non-newline that quantity of times".
I removed the #ifdef'd out code that had been left in in case pack U
reverted to earlier behavior. I did this because it complicated things,
and because the change to pack U has been in long enough and shown that
it is correct so it's not likely to be reverted.
\N meaning a named character is handled differently depending on whether
this is a pattern or not. In all cases, the output will be upgraded to
utf8 because a named character implies Unicode semantics. If not a
pattern, the \N is parsed into a utf8 string, as before. Otherwise it
will be parsed into the intermediate \N{U+...} form. If the original
was already a valid \N{U+...} constant, it is passed through unchanged.
I now check that the sequence returned by the charnames handler is not
malformed, which was lacking before.
The code in regcomp.c which dealt with interfacing with the charnames
handler has been removed. All the values should be determined by the
time regcomp.c gets involved. The affected subroutine is necessarily
restructured.
An EXACT-type node is generated for the character sequence. Such a node
has a capacity of 255 bytes, and so it is possible to overflow it. This
wasn't checked for before, but now it is, and a warning issued and the
overflowing characters are discarded.
|
| |
|
| |
|
|
|