| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement RT #121025 and add a "redundant" warning category that
currently only warns about redundant arguments to printf. Now similarly
to how we already warned about missing printf arguments:
$ ./miniperl -Ilib -we 'printf "%s\n", qw()'
Missing argument in printf at -e line 1.
We'll now warn about redundant printf arguments:
$ ./miniperl -Ilib -we 'printf "%s\n", qw(x y)'
Redundant argument in printf at -e line 1.
x
The motivation for this is that I recently fixed an insidious
long-standing 6 year old bug in a codebase I maintain that came down to
an issue that would have been detected by this warning.
Things to note about this patch:
* It found a some long-standing redundant printf arguments in our own
ExtUtils::MakeMaker code which I submitted fixes to in
https://github.com/Perl-Toolchain-Gang/ExtUtils-MakeMaker/pull/84 and
https://github.com/Perl-Toolchain-Gang/ExtUtils-MakeMaker/pull/86,
those fixes were merged into blead in v5.19.8-265-gb33b7ab
* This warning correctly handles format parameter indexes (e.g. "%1$s")
for some value of correctly. See the comment in t/op/sprintf2.t for
an extensive discussion of how I've handled that.
* We do the correct thing in my opinion when a pattern has redundant
arguments *and* an invalid printf format. E.g. for the pattern "%A%s"
with one argument we'll just warn about an invalid format as before,
but with two arguments we'll warn about the invalid format *and* the
redundant argument.
This helps to disambiguate cases where the user just meant to write a
literal "%SOMETHING" v.s. cases where he though "%S" might be a valid
printf format.
* I originally wrote this while the 5.19 series was under way, but Dave
Mitchell has noted that a warning like this should go into blead
after 5.20 is released:
"[...] I think it should go into blead just after 5.20 is
released, rather than now; I think it'd going to kick up a lot of
dust and we'll want to give CPAN module owners maximum lead time
to fix up their code. For example, if its generating warnings in
cpan/ code in blead, then we need those module authors to fix
their code, produce stable new releases, pull them back into
blead, and let them bed in before we start pushing out 5.20 RC
candidates"
I agree, but we could have our cake and eat it too if "use warnings"
didn't turn this on but an explicit "use warnings qw(redundant)" did.
Then in 5.22 we could make "use warnings" also import the "redundant"
category, and in the meantime you could turn this on
explicitly.
There isn't an existing feature for adding that kind of warning in
the core. And my attempts at doing so failed, see commentary in RT
#121025.
The warning needed to be added to a few places in sv.c because the "",
"%s" and "%-p" patterns all bypass the normal printf handling for
optimization purposes. The new warning works correctly on all of
them. See the tests in t/op/sprintf2.t.
It's worth mentioning that both Debian Clang 3.3-16 and GCC 4.8.2-12
warn about this in C code under -Wall:
$ cat redundant.c
#include <stdio.h>
int main(void) {
printf("%d\n", 123, 345);
return 0;
}
$ clang -Wall -o redundant redundant.c
redundant.c:4:25: warning: data argument not used by format string [-Wformat-extra-args]
printf("%d\n", 123, 345);
~~~~~~ ^
1 warning generated.
$ gcc -Wall -o redundant redundant.c
redundant.c: In function ‘main’:
redundant.c:4:5: warning: too many arguments for format [-Wformat-extra-args]
printf("%d\n", 123, 345);
^
So I'm not the first person to think that this might be generally
useful.
There are also other internal functions that could benefit from
missing/redundant warnings, e.g. pack. Neither of these things currently
warn, but should:
$ perl -wE 'say pack "AA", qw(x y z)'
xy
$ perl -wE 'say pack "AAAA", qw(x y z)'
xyz
I'll file a bug for that, and might take a stab at implementing it.
|
|
|
|
|
|
|
|
|
|
|
| |
Ever since the warning for missing printf arguments was added in
v5.11.2-116-g7baa469 the "missing" warning category has been defined in
terms of the "uninitialized" category, so you couldn't turn it on/off
independently of that.
As discussed in RT #121025 I'm hacking on adding a new "reduntant"
category for too many printf arguments. So add the long-missing
"missing" category in preparation for that for consistency.
|
|
|
|
|
|
|
|
| |
from 41.6sec to 34.1sec
get_a2n() is called 181540 times by __uni_latin1() which in most cases
doesn't use the whole table. Other callers tend to use the whole
table, so make a copy.
|
|
|
|
|
|
|
| |
took run time from 51.6 sec to 41.6 sec
get_I8_2_utf() is called 147000 times by cp_2_utfbytes() which
typically doesn't use the whole table, so return a reference instead to.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit also causes escaped (by a backslash) "(", "[", and "{" to be
considered literally. In the previous 2 Perl versions, the escaping was
ignored, and a (default-on) deprecation warning was raised. Now that we
have warned for 2 release cycles, we can change the meaning.of escaping
to actually do something
Warning when a literal left brace is not escaped by a backslash, will
allow us to eventually use this character in more contexts as being
meta, allowing us to extend the language. For example, the lower limit
of a quantifier could be omited, and better error checking instituted,
or things like \w could be followed by a {...} indicating some special
word character, like \w{Greek} to restrict to just Greek word
characters.
We tried to do this in v5.16, and many CPAN modules changed to backslash
their left braces at that time. However we had to back out that change
before 5.16 shipped because it turned out that escaping a left brace in
some contexts didn't work, namely when the brace would normally be a
metacharacter (for example surrounding a quantifier), and the pattern
delimiters were { }. Instead we raised the useless backslash warning
mentioned above, which has now been there for the requisite 2 cycles.
This patch partially reverts 2 patches. The first,
e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted
the deprecation of unescaped literal left brace. The other,
4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of
the useless left-characters.
Note that, as in the original attempt to deprecate, we don't raise a
warning if the left brace is the first character in the pattern. This
is because in that position it can't be a metacharacter, so we don't
require any disambiguation, and we found that if we did raise an error,
there were quite a few places where this occurred.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are a few characters in the Latin1 range that can be folded to by
above-Latin1 characters. Some of these are folded to as part of a
single character fold, like KELVIN SIGN folds to 'k'. More are folded
to as part of a multi-character fold. Until this commit, there wasn't a
quick way to distinguish between the two classes. A couple of places
only want the single-character ones. It is more efficient to look for
just those than to include the multi-char ones which end up not doing
anything. This uses a bit in l1_char_class_tab.h to indicate those
characters that are in the desired class.
|
|
|
|
|
| |
It's very rare actually for code to be presented with malformed UTF-8,
so give the compiler a hint about the likely branches.
|
|
|
|
|
|
|
|
| |
This allows for an efficient isUTF8_CHAR macro, which does its own
length checking, and uses the UTF8_INVARIANT macro for the first byte.
On EBCDIC systems this macro which does a table lookup is quite a bit
more efficient than all the branches that would normally have to be
done.
|
|
|
|
|
| |
This adds a new macro generation option for inputs that are checked
elsewhere for buffer overflow, but otherwise needs validity checks.
|
|
|
|
|
| |
This commit indents code to properly align with the new block introduced
by the previous commit, and adds a comma to a comment
|
|
|
|
|
| |
This causes the generated regcharclass.h to be valid on all
supported platforms
|
|
|
|
|
| |
This is because a future commit will execute this code multiple times,
and the library file should only be read once.
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit c4c8e61502fd5289a080f20332c6e3f9f23ce6e2.
It turns out that this scheme to bootstrap regcharclass.h onto a machine
not running ASCII created too much manual labor getting things to work.
A better solution is to cross compile on an ASCII machine for the
target. Commit 6ff677df5d6fe0f52ca0b6736f8b5a46ac402943 created the
infrastructure to do that, and this commit starts the process of
changing regen/regcharclass.pl to use that.
|
|
|
|
|
| |
This bit code is not about just ASCII folds, so skip it when doing just
those.
|
|
|
|
|
|
|
|
| |
Even though this file is not intended to be human consumable, it is
annoying to see #if ... #endif #if ...
where the #endif and #if could be consolidated.
It turns out not to be hard to do that.
|
|
|
|
|
| |
The previous commit created a block around the code that is indented by
this commit.
|
|
|
|
|
| |
This causes the generated charclass_invlists.h to be valid on all
supported platforms
|
|
|
|
|
| |
The previous commit created a block around this code, which is now
appropriately indented
|
|
|
|
|
| |
This causes the generated unicode_constants.h to be valid on all
supported platforms
|
|
|
|
| |
The previous commit created a block around this code.
|
|
|
|
|
| |
This causes the generated l1_char_class_tab.h to be valid on all
supported platforms
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This causes the generated file ebcdic_tables.h to be #included by
utfebcdic.h instead of the hand-coded tables that were formerly there.
This makes it much easier to add or remove support for EBCDIC code
pages.
The UTF-EBCDIC-related tables for 037 and POSIX-BC are somewhat modified
from what they were before. They were changed by hand minimally a long
time ago to prevent segfaults, but in so doing, they lost an important
sorting characteristic of UTF-EBCDIC. The machine-generated versions
retain the sorting, while also not doing the segfaults. utfebcdic.h has
more detail about this, regarding tr16.
|
|
|
|
|
| |
This script is to be used by others in regen/ to aid in handling
ASCII/EBCIDC items.
|
|
|
|
| |
Indent code in block formed by the previous commit
|
|
|
|
|
| |
This just changes the ordering so we don't do UTF-8 calculations unless
needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since this program was written, the abbreviated names of the control
characters have become available from charnames::viacode(). We change
to use these instead of hard-coding them in.
At the same time, this shortens the names for some of the other
characters in cases where it is easy to read the short ones.
It also changes to use mnemonics instead of hard-coded ordinals, like
using ASCII instead of x < 128. This allows it to be run on an EBCDIC
platform.
|
|
|
|
|
|
|
|
| |
This is a small improvement when a consecutive group of U8 code points
begins at 0 or ends at 255. These end points are physically impossible
of being exceeded, so there is no need to test for that end of the
range. In several places this causes a mask operation to not be
generated.
|
| |
|
|
|
|
|
| |
Until this patch, this could happen if both 'safe' and 'fast' are
specified with a cp macro.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This brings Perl regular expressions more into conformance with Unicode.
/x now accepts 5 additional characters as white space. Use of these
characters as literals under /x has been deprecated since 5.18, so now
we are free to change what they mean.
This commit eliminates the static function that processes the old
whitespace definition (and a generated macro that was used only for
this), using the already existing one for the new definition. It
refactors slightly the static function that skips comments to mesh
better with the needs of its callers, and calls it in one place where
before the code was essentially duplicated.
p5p discussion starting in
http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that
the (?[ ]) comments should be terminated the same way as regular /x
comments, and this was also done in this commit. No prior notice is
necessary as this is an experimental feature.
|
|
|
|
|
|
|
| |
This is currently allowed, but is non-graphic, and is indistinguishable
from a regular space. I was the one who initially allowed it, and did
so out of ignorance of the negative consequences of doing so. There is
no other precedent for including it.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31).
Needed compile-time limits for the PL_reg_intflags_name so that the
bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH
because the size of name array is not visible during compile time
(only const char*[] is), so modified regcomp.pl to generate the size,
made it visible only under DEBUGGING. Did extflags analogously
even though its size currently exactly 32 already. The sizeof(flags)*8
is extra paranoia for ILP64.
|
| |
|
|
|
|
| |
(and of regen/warnings.pl)
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
we return, rather than print, the warnings, so we can potentially
futz around with the string and put it where we like without
having to worry about C<select>
|
|
|
|
|
|
| |
...to limit the number of variables visible everywhere and
make it a bit easier to see what I am doing as I refactor
regen/warnings.pl
|
|
|
|
|
|
|
|
| |
In doing an audit of regcomp.c, and experimenting using
Encode::_utf8_on(), I found this one instance of a regen/regcharclass.pl
macro that could read beyond the end of the string if given malformed
UTF-8. Hence we convert to use the 'safe' form. There are no other
uses of the non-safe version, so don't need to generate them.
|
|
|
|
| |
Having these unused macros around just clutters up the header file
|
|
|
|
|
|
|
|
|
|
| |
It makes no sense to check for length safeness for The macros generated
by this program which take a single UV code point as a parameter. Prior
to this patch, it would skip trying to generate them if asked. But,
because of the way things are structured, that means that if you need
just this and the safe versions, you can't do it so easily. What this
commit does is generate the cp macro if requested even if the 'safe'
version of other macros are also requested.
|
|
|
|
|
|
|
|
|
|
|
| |
For matches that can match more than a single code point, one should
always use a macro that makes sure that one doesn't read off the end of
the buffer. This is because the buffer might end with the first N
characters of a sequence with at least N+1 in it, and we don't want to
read that N+1 position in the buffer.
If this had been in place, buggy commit 3a8bbffbce would not have
happened.
|
|
|
|
|
|
| |
The macros generated by these options are not needed in the core;
generating them just clutters up the header file, and some will actually
be forbidden by the next commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
My thinking was muddled when I made that commit, and this reverts the
essence of it. The theory was that since we have already processed the
regex pattern, we don't need to check it for malformedness, hence we
don't need the "safe" form of certain macros that check for and avoid
running off the end of the buffer. It is true that we don't have to
worry about malformedness indicating that the buffer is bigger than it
really is, but these macros can match up to three well-formed
characters, so we do have to make sure that all three are in the buffer,
and that the input isn't just the first two at the buffer's end.
This was caught by running valgrind.
|
|
|
|
| |
Indent to account for new block added in the previous commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This simplifies the macros generated which make sure there are no read
errors. Prior to this commit, the code generated looked like
(e - s) > 3
? see if things of at most length 4 match
: (e - s) > 2
? see if things of at most length 3 match
: (e - s) > 1
? see if things of at most length 2 match
: (e - s) > 0
? see if things of at most length 1 match
For things that are a single character, the ones greater than length 2
must be in UTF8, and their needed length can be determined by UTF8SKIP,
so we can get rid of most of the (e-s) tests.
This doesn't change the macros which can match multiple characters; that
is a harder to do.
|
|
|
|
|
| |
This adds a comment to the generated file that the macros are not to be
generally used.
|