summaryrefslogtreecommitdiff
path: root/regen
Commit message (Collapse)AuthorAgeFilesLines
* padcv op typeFather Chrysostomos2012-09-151-0/+2
|
* Increase $warnings::VERSION to 1.14Father Chrysostomos2012-09-141-1/+1
|
* Stop lexical warnings from turning off deprecationsFather Chrysostomos2012-09-141-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some warnings, such as deprecation warnings, are on by default: $ perl5.16.0 -e '$*' $* is no longer supported at -e line 1. But turning *on* other warnings will turn them off: $ perl5.16.0 -e 'use warnings "void"; $*' Useless use of a variable in void context at -e line 1. Either all warnings in any given scope are controlled by lexical hints, or none of them are. When a single warnings category is turned on or off, if the warn- ings were controlled by $^W, then all warnings are first turned on lexically if $^W is 1 and all warnings are turned off lexically if $^W is 0. That has the unfortunate affect of turning off warnings when it was only requested that warnings be turned on. These categories contain default warnings: ambiguous debugging deprecated inplace internal io malloc utf8 redefine syntax glob inplace overflow precedence prototype threads misc Most also contain regular warnings, but these contain *only* default warnings: debugging deprecated glob inplace malloc So we can treat $^W==0 as equivalent to qw(debugging deprecated glob inplace malloc) when enabling lexical warnings. While this means that some default warnings will still be turned off by ‘use warnings "void"’, it won’t be as many as before. So at least this is a step in the right direction. (The real solution, of course, is to allow each warning to be turned off or on on its own.)
* utf8.h: Use machine generated IS_UTF8_CHAR()Karl Williamson2012-09-131-0/+16
| | | | | | | | | | | | | | | | This takes the output of regen/regcharclass.pl for all the 1-4 byte UTF8-representations of Unicode code points, and replaces the current hand-rolled definition there. It does this only for ASCII platforms, leaving EBCDIC to be machine generated when run on such a platform. I would rather have both versions to be regenerated each time it is needed to save an EBCDIC dependency, but it takes more than 10 minutes on my computer to process the 2 billion code points that have to be checked for on ASCII platforms, and currently t/porting/regen.t runs this program every times; and that slow down would be unacceptable. If this is ever run under EBCDIC, the macro should be machine computed (very slowly). So, even though there is an EBCDIC dependency, it has essentially been solved.
* regen/regcharclass.pl: Add ability to restrict platformsKarl Williamson2012-09-131-0/+9
| | | | | This adds the capability to skip definitions if they are for other than a desired platform.
* utf8.h: Remove some EBCDIC dependenciesKarl Williamson2012-09-131-0/+12
| | | | | | | | | | | regen/regcharclass.pl has been enhanced in previous commits so that it generates as good code as these hand-defined macro definitions for various UTF-8 constructs. And, it should be able to generate EBCDIC ones as well. By using its definitions, we can remove the EBCDIC dependencies for them. It is quite possible that the EBCDIC versions were wrong, since they have never been tested. Even if regcharclass.pl has bugs under EBCDIC, it is easier to find and fix those in one place, than all the sundry definitions.
* regen/regcharclass.pl: Add optimizationKarl Williamson2012-09-131-5/+42
| | | | | | On UTF-8 input known to be valid, continuation bytes must be in the range 0x80 .. 0x9F. Therefore, any tests for being within those bounds will always be true, and may be omitted.
* regen/regcharclass.pl: White-space onlyKarl Williamson2012-09-131-7/+7
| | | | Indent a newly-formed block
* regen/regcharclass.pl: Extend previously added optimizationKarl Williamson2012-09-131-13/+71
| | | | | | | | | A previous commit added an optimization to save a branch in the generated code at the expense of an extra mask when the input class has certain characteristics. This extends that to the case where sub-portions of the class have similar characteristics. The first optimization for the entire class is moved to right before the new loop that checks each range in it.
* regen/regcharclass.pl: Rmv always true components from gen'd macroKarl Williamson2012-09-131-0/+3
| | | | | | This adds a test and returns 1 from a subroutine if the condition will always match; and in the caller it adds a check for that, and omits the condition from the generated macro.
* regen/regcharclass.pl: Add an optimizationKarl Williamson2012-09-131-0/+126
| | | | | | Branches can be eliminated from the macros that are generated here by using a mask in cases where applicable. This adds checking to see if this optimization is possible, and applies it if so.
* regen/regcharclass.pl: Rename a variableKarl Williamson2012-09-131-3/+3
| | | | I find it confusing that the array element name is the same as the full array
* regen/regcharclass.pl: Pass options deeper into call stackKarl Williamson2012-09-131-8/+8
| | | | | This is to prepare for future commits which will act differently at the deep level depending on some of the options.
* Use macro not swash for utf8 quotemetaKarl Williamson2012-09-131-0/+4
| | | | | | | | | | | | | | The rules for matching whether an above-Latin1 code point are now saved in a macro generated from a trie by regen/regcharclass.pl, and these are now used by pp.c to test these cases. This allows removal of a wrapper subroutine, and also there is no need for dynamic loading at run-time into a swash. This macro is about as big as I'm comfortable compiling in, but it saves the building of a hash that can grow over time, and removes a subroutine and interpreter variables. Indeed, performance benchmarks show that it is about the same speed as a hash, but it does not require having to load the rules in from disk the first time it is used.
* regen/regcharclass.pl: Add new output macro typeKarl Williamson2012-09-131-5/+10
| | | | | | The new type 'high' is used on only above-Latin1 code points. It is designed for code that already knows the tested code point is not Latin1, and avoids unnecessary tests.
* regen/regcharclass.pl: Add documentationKarl Williamson2012-09-131-32/+128
|
* regen/regcharclass.pl: Error check input betterKarl Williamson2012-09-131-3/+15
| | | | | This makes sure that the modifiers specified in the input are known to the program.
* regen/regcharclass.pl: Allow comments in inputKarl Williamson2012-09-131-8/+8
| | | | | | Lines whose first non-blank character is a '#' are now considered to be comments, and ignored. This allows the moving of some lines that have been commented out back to after the __DATA__ where they really belong.
* regen/unicode_constants.pl: Add name parameterKarl Williamson2012-09-131-3/+11
| | | | | | | A future commit will want to use the first surrogate code point's UTF-8 value. Add this to the generated macros, and give it a name, since there is no official one. The program has to be modified to cope with this.
* regexec.c: Use new macros instead of swashesKarl Williamson2012-09-131-3/+0
| | | | | | | | | | A previous commit has caused macros to be generated that will match Unicode code points of interest to the \X algorithm. This patch uses them. This speeds up modern Korean processing by 15%. Together with recent previous commits, the throughput of modern Korean under \X has more than doubled, and is now comparable to other languages (which have increased themselved by 35%)
* regen/regcharclass.pl: Generate macros for \X processingKarl Williamson2012-09-131-0/+28
| | | | | | | \X is implemented in regexec.c as a complicated series of property look-ups. It turns out that many of those are for just a few code points, and so can be more efficiently implemented with a macro than a swash. This generates those.
* regen/regcharclass.pl: Change to work on an empty classKarl Williamson2012-09-131-0/+1
| | | | | | Future commits will add Unicode properties for this to generate macros, and some of them may be empty in some Unicode releases. This just causes such a generated macro to evaluate to 0.
* regen/regcharclass.pl: Fix bug for character '0'Karl Williamson2012-09-131-1/+1
| | | | | | The character '0' could be omitted from some generated macros due to it's testing the value of a hash entry (getting 0 or false) instead of if it exists or not.
* regen/regcharclass.pl: Work on EBCDIC platformsKarl Williamson2012-09-131-8/+40
| | | | | | | | | This will now automatically generate macros for non-ASCII platforms, by mapping the Unicode input to native output. Doing this will allow several cases of EBCDIC dependencies in other code to be removed, and fixes the bug that this previously had with non-ASCII platforms.
* regen/regcharclass.pl: Remove Encode:: dependencyKarl Williamson2012-09-131-6/+3
| | | | Newer options to unpack alleviate the need for Encode, and run faster.
* regen/regcharclass.pl: Handle ranges, \p{}Karl Williamson2012-09-131-33/+34
| | | | | | | | Instead of having to list all code points in a class, you can now use \p{} or a range. This changes some classes to use the \p{}, so that any changes Unicode makes to the definitions don't have to manually be done here as well.
* Rename regen'd hdr to reflect expanded capabilitiesKarl Williamson2012-09-131-4/+4
| | | | | The recently added utf8_strings.h has been expanded to include more than just strings. I'm renaming it to avoid confusion.
* regen/utf8_strings.pl: Add ability to get native charsetKarl Williamson2012-09-131-8/+25
| | | | | | | This adds a new capability to this program: to input a Unicode code point and create a macro that expands to the platform's native value for it. This will allow removal of a bunch of EBCDIC dependencies in the core.
* regen/utf8_strings.pl: Allow explicit default on inputKarl Williamson2012-09-131-6/+7
| | | | | | An input line without a command is considered to be a request for the UTF-8 encoded string of the code point. This allows an explicit 'string' to be used.
* regen/utf8_strings.pl: Copy empty input lines to outputKarl Williamson2012-09-131-8/+17
| | | | This allows the generated .h to look better.
* /regcharclass.pl, utf8_strings.pl: Add guard to .hKarl Williamson2012-09-132-0/+10
| | | | | | Future commits will have other headers #include the headers generated by these programs. It is best to guard against the preprocessor from trying to process these twice
* Add utility and .h for character's UTF-8Karl Williamson2012-08-271-0/+108
| | | | | | | | | This add regen/utf8_strings.pl takes Unicode characters and generates utf8_strings.h to contains #defines for macros that translate from the name to the UTF-8. This is needed in a few places, where previously things were manually figured out and hard-coded in. Doing this instead makes this easier, and removes EBCDIC dependencies/bugs, as the file would simply be regen'd on an EBCDIC platform.
* regen/regcharclass.pl: Comment out obsolete codeKarl Williamson2012-08-271-10/+12
| | | | | | Tricky folds have been removed from the code, so the removed #defines are obsolete. I'm leaving this in, in so it can conveniently be referred to in case we ever need it again.
* Remove boolkeys opFather Chrysostomos2012-08-262-3/+1
|
* Banish boolkeysFather Chrysostomos2012-08-251-1/+2
| | | | | | | | | | | | | | Since 6ea72b3a1, rv2hv and padhv have had the ability to return boo- leans in scalar context, instead of bucket stats, if flagged the right way. sub { %hash || ... } is optimised to take advantage of this. If the || is in unknown context at compile time, the %hash is flagged as being maybe a true boolean. When flagged that way, it returns a bool- ean if block_gimme() returns G_VOID. If rv2hv and padhv can already do this, then we don’t need the boolkeys op any more. We can just flag the rv2hv to return a boolean. In all the cases where boolkeys was used, we know at compile time that it is true boolean context, so we add a new flag for that.
* Add caching to inversion list searchesKarl Williamson2012-08-251-1/+2
| | | | | | | Benchmarking showed some speed-up when the result of the previous search in an inversion list is cached, thus potentially avoiding a search in the next call. This adds a field to each inversion list which caches its previous search result.
* Add freed ops to PL_op_(name|desc)Father Chrysostomos2012-08-081-0/+2
| | | | This is useful for debugging, especially with -DT.
* mktables: Generate tables for chars that aren't in final fold posKarl Williamson2012-08-022-3/+10
| | | | | | | | | | This starts with the existing table that mktables generates that lists all the characters in Unicode that occur in multi-character folds, and aren't in the final positions of any such fold. It generates data structures with this information to make it quickly available to code that wants to use it. Future commits will use these tables.
* regen/mk_invlists: Add mode to generate above-Latin1 onlyKarl Williamson2012-08-021-3/+23
| | | | | | This change adds the ability to specify that an output inversion list is to contain only those code points that are above Latin-1. Typically, the Latin-1 ones will be accessed from some other means.
* regen/mk_PL_charclass.pl: Remove obsolete codeKarl Williamson2012-08-021-2/+0
| | | | Octals are no longer checked via this mechanism.
* Flatten vstrings modified in placeFather Chrysostomos2012-07-271-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A substitution forces its target to a string upon successful substitu- tion, even if the substitution did nothing: $ ./perl -Ilib -le '$a = *f; $a =~ s/f/f/; print ref \$a' SCALAR Notice that $a is no longer a glob after s///. But vstrings are different: $ ./perl -Ilib -le '$a = v102; $a =~ s/f/f/; print ref \$a' VSTRING I fixed this in 5.16 (1e6bda93) for those cases where the vstring ends up with a value that doesn’t correspond to the actual string: $ ./perl -Ilib -le '$a = v102; $a =~ s/f/o/; print ref \$a' SCALAR It works through vstring set-magic, that does the check and removes the magic if it doesn’t match. I did it that way because I couldn’t think of any other way to fix bug #29070, and I didn’t realise at the time that I hadn’t fixed all the bugs. By making SvTHINKFIRST true on a vstring, we force it through sv_force_normal before any in-place string operations. We can also make sv_force_normal handle vstrings as well. This fixes all the lin- gering-vstring-magic bugs in just two lines, making the vstring set- magic (which is also slow) redundant. It also allows the special case in sv_setsv_flags to be removed. Or at least that was what I had hoped. It turns out that pp_subst, twists and turns in tortuous ways, and needs special treatment for things like this. And do_trans functions wasn’t checking SvTHINKFIRST when arguably it should have. I tweaked sv_2pv{utf8,byte} to avoid copying magic variables that do not need copying.
* Merge ck_trunc and ck_chdirFather Chrysostomos2012-07-251-1/+1
| | | | | | ck_chdir, added in 2006 (d4ac975e) duplicates ck_trunc, added in 1993 (79072805), except for a null op check which is harmless when applied to chdir.
* handy.h: Free up bits in PL_charclass[]Karl Williamson2012-07-241-28/+15
| | | | | | | | | | | | | | | | | | | | | | | | This array is a bit map containing the Posix and similar character classes for the first 256 code points. Prior to this commit many character classes were represented by two bits, one for characters that are in it over the full Latin-1 range, and one for just the ASCII characters that are in it. The number of bits in use was approaching the 32-bit limit available without playing games. This commit takes advantage of a recent commit that adds a bit to the table for all the ASCII characters, and the fact that the ASCII characters in a character class are a subset of the full Latin1 range. So, iff both the full-range character class bit and the ASCII bit is set is that character an ASCII-range character with the given character class. A new internal macro is created to generate code to determine if a character is an ASCII range character with the given class. It's not clear if the generated code is faster or slower than the full range version. The result is that nearly half the bits are freed up, as the ones for the ASCII-range are now redundant.
* handy.h: Move bit shifting into base macroKarl Williamson2012-07-241-2/+2
| | | | | | This changes the #defines to be just the shift number, while doing the shifting in the macro that the number is passed to. This will prove useful in future commits
* handy.h: l1_charclass.h: Add bit for matching ASCIIKarl Williamson2012-07-241-2/+3
| | | | | | | This does not replace the isASCII macro definition, as I think the current one is more efficient than this one provides. But future commits will rely on all the named character classes (e.g., /[[:ascii:]]/) having a bit, and this is the only one missing.
* regen/regcomp.pl: Allow ';' in commentsKarl Williamson2012-07-241-1/+1
| | | | | | If a comment contained a semi-colon, the regular expression's greedy quantifier would think the portion of the comment before it was part of the data to be processed
* Fixup indentationJan Dubois2012-07-181-6/+9
|
* Adding support for Visual C's __declspec(noreturn) function declarations to perlDaniel Dragan2012-07-181-2/+5
| | | | | | | | | This will reduce the machine code size on Visual C Perl, by removing C stack clean up opcodes and possible jmp opcodes after croak() and similar functions. Perl's existing __attribute__noreturn__ macro (and therefore GCC's __attribute__((noreturn)) ) is fundamentally incompatible with MS's implementation for noreturn functions. win32.h already has _MSC_VER aware code blocks, so adding more isn't a problem.
* [perl #113470] Constant folding for packFather Chrysostomos2012-07-131-1/+1
| | | | | | | | | | | | | This takes the pessimistic approach of skipping it for any first argu- ment that is not a plain non-magical PV, just in case there is a 'p' or 'P' in the stringified form. Otherwise it scans the PV for 'p' or 'P' and skips the folding if either is present. Then it falls through to the usual op-filtering logic. I nearly made ‘pack;’ crash, so I added a test to bproto.t.
* mg_vtable.pl: Mention all generated filesFather Chrysostomos2012-07-121-0/+2
|