summaryrefslogtreecommitdiff
path: root/regen
Commit message (Collapse)AuthorAgeFilesLines
* Add a new warning about redundant printf argumentsÆvar Arnfjörð Bjarmason2014-06-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement RT #121025 and add a "redundant" warning category that currently only warns about redundant arguments to printf. Now similarly to how we already warned about missing printf arguments: $ ./miniperl -Ilib -we 'printf "%s\n", qw()' Missing argument in printf at -e line 1. We'll now warn about redundant printf arguments: $ ./miniperl -Ilib -we 'printf "%s\n", qw(x y)' Redundant argument in printf at -e line 1. x The motivation for this is that I recently fixed an insidious long-standing 6 year old bug in a codebase I maintain that came down to an issue that would have been detected by this warning. Things to note about this patch: * It found a some long-standing redundant printf arguments in our own ExtUtils::MakeMaker code which I submitted fixes to in https://github.com/Perl-Toolchain-Gang/ExtUtils-MakeMaker/pull/84 and https://github.com/Perl-Toolchain-Gang/ExtUtils-MakeMaker/pull/86, those fixes were merged into blead in v5.19.8-265-gb33b7ab * This warning correctly handles format parameter indexes (e.g. "%1$s") for some value of correctly. See the comment in t/op/sprintf2.t for an extensive discussion of how I've handled that. * We do the correct thing in my opinion when a pattern has redundant arguments *and* an invalid printf format. E.g. for the pattern "%A%s" with one argument we'll just warn about an invalid format as before, but with two arguments we'll warn about the invalid format *and* the redundant argument. This helps to disambiguate cases where the user just meant to write a literal "%SOMETHING" v.s. cases where he though "%S" might be a valid printf format. * I originally wrote this while the 5.19 series was under way, but Dave Mitchell has noted that a warning like this should go into blead after 5.20 is released: "[...] I think it should go into blead just after 5.20 is released, rather than now; I think it'd going to kick up a lot of dust and we'll want to give CPAN module owners maximum lead time to fix up their code. For example, if its generating warnings in cpan/ code in blead, then we need those module authors to fix their code, produce stable new releases, pull them back into blead, and let them bed in before we start pushing out 5.20 RC candidates" I agree, but we could have our cake and eat it too if "use warnings" didn't turn this on but an explicit "use warnings qw(redundant)" did. Then in 5.22 we could make "use warnings" also import the "redundant" category, and in the meantime you could turn this on explicitly. There isn't an existing feature for adding that kind of warning in the core. And my attempts at doing so failed, see commentary in RT #121025. The warning needed to be added to a few places in sv.c because the "", "%s" and "%-p" patterns all bypass the normal printf handling for optimization purposes. The new warning works correctly on all of them. See the tests in t/op/sprintf2.t. It's worth mentioning that both Debian Clang 3.3-16 and GCC 4.8.2-12 warn about this in C code under -Wall: $ cat redundant.c #include <stdio.h> int main(void) { printf("%d\n", 123, 345); return 0; } $ clang -Wall -o redundant redundant.c redundant.c:4:25: warning: data argument not used by format string [-Wformat-extra-args] printf("%d\n", 123, 345); ~~~~~~ ^ 1 warning generated. $ gcc -Wall -o redundant redundant.c redundant.c: In function ‘main’: redundant.c:4:5: warning: too many arguments for format [-Wformat-extra-args] printf("%d\n", 123, 345); ^ So I'm not the first person to think that this might be generally useful. There are also other internal functions that could benefit from missing/redundant warnings, e.g. pack. Neither of these things currently warn, but should: $ perl -wE 'say pack "AA", qw(x y z)' xy $ perl -wE 'say pack "AAAA", qw(x y z)' xyz I'll file a bug for that, and might take a stab at implementing it.
* Split up the fake "missing" warning category into an actual categoryÆvar Arnfjörð Bjarmason2014-06-211-1/+7
| | | | | | | | | | | Ever since the warning for missing printf arguments was added in v5.11.2-116-g7baa469 the "missing" warning category has been defined in terms of the "uninitialized" category, so you couldn't turn it on/off independently of that. As discussed in RT #121025 I'm hacking on adding a new "reduntant" category for too many printf arguments. So add the long-missing "missing" category in preparation for that for consistency.
* avoid copying the while ebcidic mapping to the stack calling get_a2n()Tony Cook2014-06-186-7/+8
| | | | | | | | from 41.6sec to 34.1sec get_a2n() is called 181540 times by __uni_latin1() which in most cases doesn't use the whole table. Other callers tend to use the whole table, so make a copy.
* avoid copying the whole map to the stack on each call to get_I8_2_utf()Tony Cook2014-06-182-5/+5
| | | | | | | took run time from 51.6 sec to 41.6 sec get_I8_2_utf() is called 147000 times by cp_2_utfbytes() which typically doesn't use the whole table, so return a reference instead to.
* Deprecate unescaped literal "{" in regex patternsKarl Williamson2014-06-121-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit also causes escaped (by a backslash) "(", "[", and "{" to be considered literally. In the previous 2 Perl versions, the escaping was ignored, and a (default-on) deprecation warning was raised. Now that we have warned for 2 release cycles, we can change the meaning.of escaping to actually do something Warning when a literal left brace is not escaped by a backslash, will allow us to eventually use this character in more contexts as being meta, allowing us to extend the language. For example, the lower limit of a quantifier could be omited, and better error checking instituted, or things like \w could be followed by a {...} indicating some special word character, like \w{Greek} to restrict to just Greek word characters. We tried to do this in v5.16, and many CPAN modules changed to backslash their left braces at that time. However we had to back out that change before 5.16 shipped because it turned out that escaping a left brace in some contexts didn't work, namely when the brace would normally be a metacharacter (for example surrounding a quantifier), and the pattern delimiters were { }. Instead we raised the useless backslash warning mentioned above, which has now been there for the requisite 2 cycles. This patch partially reverts 2 patches. The first, e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted the deprecation of unescaped literal left brace. The other, 4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of the useless left-characters. Note that, as in the original attempt to deprecate, we don't raise a warning if the left brace is the first character in the pattern. This is because in that position it can't be a metacharacter, so we don't require any disambiguation, and we found that if we did raise an error, there were quite a few places where this occurred.
* add a warning for using the :win32 PerlIO layerTony Cook2014-06-101-1/+3
|
* regcomp.c: Skip work that is a no-opKarl Williamson2014-06-011-5/+36
| | | | | | | | | | | | There are a few characters in the Latin1 range that can be folded to by above-Latin1 characters. Some of these are folded to as part of a single character fold, like KELVIN SIGN folds to 'k'. More are folded to as part of a multi-character fold. Until this commit, there wasn't a quick way to distinguish between the two classes. A couple of places only want the single-character ones. It is more efficient to look for just those than to include the multi-char ones which end up not doing anything. This uses a bit in l1_char_class_tab.h to indicate those characters that are in the desired class.
* Add some (UN)?LIKELY() to UTF8 handlingKarl Williamson2014-05-311-3/+4
| | | | | It's very rare actually for code to be presented with malformed UTF-8, so give the compiler a hint about the likely branches.
* utf8.h: Use new macro type from previous commitKarl Williamson2014-05-311-4/+8
| | | | | | | | This allows for an efficient isUTF8_CHAR macro, which does its own length checking, and uses the UTF8_INVARIANT macro for the first byte. On EBCDIC systems this macro which does a table lookup is quite a bit more efficient than all the branches that would normally have to be done.
* regen/regcharclass.pl: Add new macro type with intermed checkingKarl Williamson2014-05-311-10/+38
| | | | | This adds a new macro generation option for inputs that are checked elsewhere for buffer overflow, but otherwise needs validity checks.
* regen/regcharclass.pl: Comment, white-space onlyKarl Williamson2014-05-311-18/+20
| | | | | This commit indents code to properly align with the new block introduced by the previous commit, and adds a comma to a comment
* regen/regcharclass.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-40/+46
| | | | | This causes the generated regcharclass.h to be valid on all supported platforms
* regen/regcharclass.pl: make a 'do' into a 'require'Karl Williamson2014-05-311-1/+1
| | | | | This is because a future commit will execute this code multiple times, and the library file should only be read once.
* Revert "regen/regcharclass.pl: Make more EBCDIC-friendly"Karl Williamson2014-05-311-18/+2
| | | | | | | | | | | This reverts commit c4c8e61502fd5289a080f20332c6e3f9f23ce6e2. It turns out that this scheme to bootstrap regcharclass.h onto a machine not running ASCII created too much manual labor getting things to work. A better solution is to cross compile on an ASCII machine for the target. Commit 6ff677df5d6fe0f52ca0b6736f8b5a46ac402943 created the infrastructure to do that, and this commit starts the process of changing regen/regcharclass.pl to use that.
* regen/regcharclass_multi_char_folds.pl: Don't do unnecessary workKarl Williamson2014-05-311-1/+1
| | | | | This bit code is not about just ASCII folds, so skip it when doing just those.
* regen/mk_invlists.pl: Remove unnecessary #if'sKarl Williamson2014-05-311-2/+22
| | | | | | | | Even though this file is not intended to be human consumable, it is annoying to see #if ... #endif #if ... where the #endif and #if could be consolidated. It turns out not to be hard to do that.
* regen/mk_invlists.pl: White-space onlyKarl Williamson2014-05-311-113/+114
| | | | | The previous commit created a block around the code that is indented by this commit.
* regen/mk_invlists.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-9/+34
| | | | | This causes the generated charclass_invlists.h to be valid on all supported platforms
* regen/unicode_constants.pl: White-space onlyKarl Williamson2014-05-311-72/+72
| | | | | The previous commit created a block around this code, which is now appropriately indented
* regen/unicode_constants.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-8/+19
| | | | | This causes the generated unicode_constants.h to be valid on all supported platforms
* regen/mk_PL_charclass.pl: White-space onlyKarl Williamson2014-05-311-51/+51
| | | | The previous commit created a block around this code.
* regen/mk_PL_charclass.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-1/+18
| | | | | This causes the generated l1_char_class_tab.h to be valid on all supported platforms
* Make many EBCDIC tables generated instead of hand-codedKarl Williamson2014-05-311-0/+204
| | | | | | | | | | | | | | This causes the generated file ebcdic_tables.h to be #included by utfebcdic.h instead of the hand-coded tables that were formerly there. This makes it much easier to add or remove support for EBCDIC code pages. The UTF-EBCDIC-related tables for 037 and POSIX-BC are somewhat modified from what they were before. They were changed by hand minimally a long time ago to prevent segfaults, but in so doing, they lost an important sorting characteristic of UTF-EBCDIC. The machine-generated versions retain the sorting, while also not doing the segfaults. utfebcdic.h has more detail about this, regarding tr16.
* Add utilities for dealing with EBCDICKarl Williamson2014-05-311-0/+289
| | | | | This script is to be used by others in regen/ to aid in handling ASCII/EBCIDC items.
* regen/unicode_constants.pl: White-space onlyKarl Williamson2014-05-311-18/+18
| | | | Indent code in block formed by the previous commit
* regen/unicode_constants.pl: Rearrange code orderKarl Williamson2014-05-311-7/+11
| | | | | This just changes the ordering so we don't do UTF-8 calculations unless needed.
* regen/mk_PL_charclass.pl: Rmv hard-coded char namesKarl Williamson2014-05-311-83/+49
| | | | | | | | | | | | | Since this program was written, the abbreviated names of the control characters have become available from charnames::viacode(). We change to use these instead of hard-coding them in. At the same time, this shortens the names for some of the other characters in cases where it is easy to read the short ones. It also changes to use mnemonics instead of hard-coded ordinals, like using ASCII instead of x < 128. This allows it to be run on an EBCDIC platform.
* regen/regcharclass.pl: Improve the generated codeKarl Williamson2014-05-301-0/+15
| | | | | | | | This is a small improvement when a consecutive group of U8 code points begins at 0 or ends at 255. These end points are physically impossible of being exceeded, so there is no need to test for that end of the range. In several places this causes a mask operation to not be generated.
* regen/regcharclass_multi_char_folds.pl: Add some commentsKarl Williamson2014-05-301-6/+13
|
* regen/regcharclass.pl: Don't generate macro twiceKarl Williamson2014-05-301-0/+8
| | | | | Until this patch, this could happen if both 'safe' and 'fast' are specified with a cp macro.
* /x in patterns now includes all \p{PatWS}Karl Williamson2014-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | This brings Perl regular expressions more into conformance with Unicode. /x now accepts 5 additional characters as white space. Use of these characters as literals under /x has been deprecated since 5.18, so now we are free to change what they mean. This commit eliminates the static function that processes the old whitespace definition (and a generated macro that was used only for this), using the already existing one for the new definition. It refactors slightly the static function that skips comments to mesh better with the needs of its callers, and calls it in one place where before the code was essentially duplicated. p5p discussion starting in http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that the (?[ ]) comments should be terminated the same way as regular /x comments, and this was also done in this commit. No prior notice is necessary as this is an experimental feature.
* Deprecate NBSP in \N{...} namesKarl Williamson2014-05-301-0/+3
| | | | | | | This is currently allowed, but is non-graphic, and is indistinguishable from a regular space. I was the one who initially allowed it, and did so out of ignorance of the negative consequences of doing so. There is no other precedent for including it.
* add a 5.21 feature bundleRicardo Signes2014-05-261-1/+3
|
* Fix for Coverity perl5 CID 29034: Out-of-bounds read (OVERRUN) ↵Jarkko Hietaniemi2014-04-301-1/+16
| | | | | | | | | | | | overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31). Needed compile-time limits for the PL_reg_intflags_name so that the bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH because the size of name array is not visible during compile time (only const char*[] is), so modified regcomp.pl to generate the size, made it visible only under DEBUGGING. Did extflags analogously even though its size currently exactly 32 already. The sizeof(flags)*8 is extra paranoia for ILP64.
* warnings.pm: improve awkward sentence in podRicardo Signes2014-03-181-4/+4
|
* bump the version of warnings.pmRicardo Signes2014-03-181-2/+2
| | | | (and of regen/warnings.pl)
* remove references to perllexwarn from warnings.pmRicardo Signes2014-03-181-6/+3
|
* regen/warnings.pl no longer touches perllexwarnRicardo Signes2014-03-181-19/+0
|
* merge most of perllexwarn into warningsRicardo Signes2014-03-181-7/+465
|
* replace printTree with warningsTreeRicardo Signes2014-03-181-9/+12
| | | | | | we return, rather than print, the warnings, so we can potentially futz around with the string and put it where we like without having to worry about C<select>
* enclose warnings.h generation in a blockRicardo Signes2014-03-181-40/+45
| | | | | | ...to limit the number of variables visible everywhere and make it a bit easier to see what I am doing as I refactor regen/warnings.pl
* regcomp.c: Don't read past string-endKarl Williamson2014-03-121-1/+1
| | | | | | | | In doing an audit of regcomp.c, and experimenting using Encode::_utf8_on(), I found this one instance of a regen/regcharclass.pl macro that could read beyond the end of the string if given malformed UTF-8. Hence we convert to use the 'safe' form. There are no other uses of the non-safe version, so don't need to generate them.
* regen/regcharclass.pl: Don't generate unused macrosKarl Williamson2014-03-121-4/+4
| | | | Having these unused macros around just clutters up the header file
* regen/regcharclass.pl: Generate correct macro instead of skippingKarl Williamson2014-03-121-2/+1
| | | | | | | | | | It makes no sense to check for length safeness for The macros generated by this program which take a single UV code point as a parameter. Prior to this patch, it would skip trying to generate them if asked. But, because of the way things are structured, that means that if you need just this and the safe versions, you can't do it so easily. What this commit does is generate the cp macro if requested even if the 'safe' version of other macros are also requested.
* regen/regcharclass.pl: Forbid non-safe macros for multi-char matchesKarl Williamson2014-03-011-3/+13
| | | | | | | | | | | For matches that can match more than a single code point, one should always use a macro that makes sure that one doesn't read off the end of the buffer. This is because the buffer might end with the first N characters of a sequence with at least N+1 in it, and we don't want to read that N+1 position in the buffer. If this had been in place, buggy commit 3a8bbffbce would not have happened.
* regen/regcharclass.pl: Don't generate unused macrosKarl Williamson2014-03-011-3/+3
| | | | | | The macros generated by these options are not needed in the core; generating them just clutters up the header file, and some will actually be forbidden by the next commit.
* Revert most of 3a8bbffbce: Avoid unnecessary malformed checkingKarl Williamson2014-03-011-2/+2
| | | | | | | | | | | | | | My thinking was muddled when I made that commit, and this reverts the essence of it. The theory was that since we have already processed the regex pattern, we don't need to check it for malformedness, hence we don't need the "safe" form of certain macros that check for and avoid running off the end of the buffer. It is true that we don't have to worry about malformedness indicating that the buffer is bigger than it really is, but these macros can match up to three well-formed characters, so we do have to make sure that all three are in the buffer, and that the input isn't just the first two at the buffer's end. This was caught by running valgrind.
* regen/regcharclass.pl: White-space; comment nits onlyKarl Williamson2014-03-011-8/+8
| | | | Indent to account for new block added in the previous commit
* regen/regcharclass.pl: Simplify generated safe macrosKarl Williamson2014-03-011-12/+117
| | | | | | | | | | | | | | | | | | | | | This simplifies the macros generated which make sure there are no read errors. Prior to this commit, the code generated looked like (e - s) > 3 ? see if things of at most length 4 match : (e - s) > 2 ? see if things of at most length 3 match : (e - s) > 1 ? see if things of at most length 2 match : (e - s) > 0 ? see if things of at most length 1 match For things that are a single character, the ones greater than length 2 must be in UTF8, and their needed length can be determined by UTF8SKIP, so we can get rid of most of the (e-s) tests. This doesn't change the macros which can match multiple characters; that is a harder to do.
* regen/regcharclass.pl: Warn that macros are internal onlyKarl Williamson2014-03-011-1/+6
| | | | | This adds a comment to the generated file that the macros are not to be generally used.