summaryrefslogtreecommitdiff
path: root/regen
Commit message (Collapse)AuthorAgeFilesLines
* bump feature.pm $VERSIONDavid Mitchell2016-07-171-1/+1
|
* Update docs for declared_refsFather Chrysostomos2016-07-171-0/+16
|
* Add declared_refs feature featureFather Chrysostomos2016-07-171-0/+1
|
* Add experimental::declared_refs warn categFather Chrysostomos2016-07-171-0/+2
|
* Get regen to work before 5.10Father Chrysostomos2016-07-161-2/+3
| | | | | Since it uses the system perl, it’s useful to keep it working with earlier versions.
* [perl #128597] Crash from gp_free/ckWARN_dFather Chrysostomos2016-07-111-2/+4
| | | | | | | See the explanation in the test added and in the RT ticket. The solution is to make the warn macros check that PL_curcop is non-null.
* Prepare for Unicode 9.0Karl Williamson2016-06-211-61/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
* Sort @def before generating $warnings::DEFAULT.Matthew Horsfall2016-06-211-2/+2
| | | | This makes the comment easier to read.
* Another op description correction: & -> &.Father Chrysostomos2016-05-201-3/+3
| | | | | The string bitwise ops have dots in them, which should be included in the op descriptions.
* Correct ‘bitiwse’ in two op descriptionsFather Chrysostomos2016-05-201-2/+2
| | | | Oops!
* Allow assignment to &CORE::keys()Father Chrysostomos2016-05-201-1/+1
|
* Allow &CORE::foo() with hash functionsFather Chrysostomos2016-05-201-1/+5
| | | | | &CORE::keys does not yet work as an lvalue. (I’m not sure how to make that work.)
* Add avhvswitch opFather Chrysostomos2016-05-201-0/+1
| | | | | &CORE::keys() et al. will use this to switch between keys and akeys depending on the argument type.
* regen/opcodes: Re-order aeach, akeys, and avaluesFather Chrysostomos2016-05-201-1/+1
| | | | | In a forthcoming commit, I will need them to be in the same order as the corresponding hash functions.
* Give feature.pm the concept of no-op featuresFather Chrysostomos2016-05-201-2/+9
|
* Remove @experimental from regen/feature.plFather Chrysostomos2016-05-201-7/+0
| | | | | | | | | | | Originally, we were going to have feature.pm warning when enabling an experimental feature. That changed, though, when we introduced the :all tag, because it is unkind for :all to warn. So in v5.17.6-49-g64fbf0d we started warning when a feature is used, not enabled. It does not appear that that will ever change, so we might as well remove the dead code (and comments) from regen/feature.pl.
* Increase $feature::VERSION to 1.44Father Chrysostomos2016-05-201-1/+1
|
* Update feature.pm docs for lex sub acceptanceFather Chrysostomos2016-05-201-8/+12
|
* [perl #128187] Forbid keys @_ in assigned lv subFather Chrysostomos2016-05-201-1/+2
| | | | | This is a continuation of this commit’s great grandparent, extending the error to arrays.
* better glibc i_modulo bug handlingjimc2016-05-171-4/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pp-i-modulo code currently detects a glibc bug at runtime, at the 1st exec of each I_MODULO op. This is suboptimal; the bug should be detectable early, and PL_ppaddr[I_MODULO] updated just once, before any optrees are built. Then, because we avoid the need to fixup I_MODULO ops in already built optrees, we can drop the !PERL_DEBUG_READONLY_OPS limitation on the alternative/workaround I_MODULO implementation that avoids the bug. perl.c: bug detection code is copied from PP(i_modulo), into S_fixup_platform_bugs(), and called from perl_construct(). It patches Perl_pp_i_modulo_1() into PL_ppaddr[I_MODULO] when needed. pp.c: PP(i_modulo_0), the original implementation, is renamed to PP(i_modulo) PP(i_modulo_1), the bug-fix workaround, is renamed _glibc_bugfix it is #ifdefd as before, but dropping !PERL_DEBUG_READONLY_OPS PP(i_modulo) - the 1st-exec switcher code, is dropped ocode.pl: Two i_modulo entries are added to @raw_alias. - 1st alias: Perl_pp_i_modulo => 'i_modulo' - 2nd alt: Perl_pp_i_modulo_glibc_bugfix => 'i_modulo' 1st is a restatement of the default alias/mapping that would be created without the line. 2nd line is then seen as alternative to the explicit mapping set by 1st. Alternative functions are written to pp_proto.h after the standard Perl_pp_* list, and include #if-cond, #endif wrappings, as was specified by 2nd @raw_alias addition. Changes tested by inserting '1 ||' into the 3 ifdefs and bug-detection code. TODO: In pp_proto.h generation, the #ifdef wrapping code which handles the alternative functions looks like it should also be used for the non-alternate functions. In particular, there are a handful of pp-function prototypes that should be wrapped with #ifdef HAS_SOCKET. That said, there have been no problem reports, so I left it alone. TonyC: make S_fixup_platform_bugs static, porting/libperl.t was failing.
* embed.fnc: Alter 'b' flag meaningKarl Williamson2016-05-121-2/+6
| | | | | | | | | | | | | This commit changes this flag to mean that the backward compatibility functions are compiled unless the -DNO_MATHOMS cflag is specified to Configure. Previously the meaning was sort of like that but not precisely. Doing this means that the prototypes that needed to be manually added to mathoms.c are no longer needed. No special parameter assertions have to be made. makedef.pl no longer needs to parse mathoms.c and have special cases for it. And several special case entries in embed.fnc can be non-special cased.
* regen/embed.pl: Don't: #define FOO FOOKarl Williamson2016-05-121-1/+3
| | | | | Doing so would be useless. This doesn't currently happen, but would in a couple of commits.
* embed.fnc: Change 'b' flag to not imply 'p' flagKarl Williamson2016-05-121-1/+1
| | | | | By doing this, we make it more general, which will be useful in a few commits.
* regen/embed.pl: Verify flags field of embed.fncKarl Williamson2016-05-121-0/+3
| | | | Make sure that the specified flags are legal.
* feature.pm: add the v5.25 bundlev5.25.0Ricardo Signes2016-05-091-1/+3
|
* silence -Wparentheses-equalityDavid Mitchell2016-03-281-2/+2
| | | | | | | | | | | | | | | | | | Clang has taken it upon itself to warn when an equality is wrapped in double parentheses, e.g. ((foo == bar)) Which is a bit dumb, as any code along the lines of #define isBAR (foo == BAR) if (isBAR) {} will trigger the warning. This commit shuts clang up by putting in a harmless cast: #define isBAR cBOOL(foo == BAR)
* regen/mk_invlists.pl: Revamp so works on earlier UnicodesKarl Williamson2016-03-171-336/+596
| | | | | | | | The code that generates the tables for the \b{foo} handling (in regexec.c) did not correctly work when compiled on an earlier Unicode. This fixes things up to do that, consolidating some common code into a common function and making the generated hdr file look nice, with the tables taking fewer columns of screen space
* bump $strict::VERSION and $warnings::VERSIONTony Cook2016-03-021-1/+1
|
* narrow the filename check in strict.pm/warnings.pmAristotle Pagaltzis2016-03-021-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | • The code previously assumed that any filename basename besides `strict.pm` meant that the user mistyped `use strict` (e.g. as `use Strict`). But that could just mean the file was not loaded from the filesystem, e.g. due to naïve fatpacking. This is fixed by adding a guard to check that an unexpected value really is a mis-capitalised variant of `strict.pm`. • The code previously insisted on either slash or backslash as the directory separator, which is not strictly portable (though nobody noticed in years; apparently nobody has tried to run a recent-ish on a MacOS Classic or RiscOS system). This is fixed by switching to \b as a best effort, to avoid going down the rabbit hole of platform-specific separators. • The code previously used an `unless` statement, declared lexical variables inside its block, and used ${\EXPR} to interpolate the __PACKAGE__ constant into the regexp. Each of these increases the size of the optree, which is only ever executed once, then sticks around wasting some hundred(s) bytes in almost every single Perl program in the world. This is fixed for warnings.pm by rewriting the code with no use of any temporary variables and single-quoted strings instead of regexp literals. In strict.pm, we can do even better by moving the code to the BEGIN block, since BEGIN CVs are freed after running. (We do not add one to warnings.pm since BEGIN blocks have a creation cost.)
* Use table lookup for qr/\b{wb}/Karl Williamson2016-02-031-0/+240
| | | | | | | | This follows the recent commits for lb and gcb, and generates a table at regen time for Word Breaking. The result may run faster, depending on the compiler optimization capabilities, than before, and is easier to maintain, as it's easier to smack a new rule into the regen perl script than it is to change the C code.
* split CXt_LOOP_FOR into CXt_LOOP_LIST,CXt_LOOP_ARYDavid Mitchell2016-02-031-2/+3
| | | | | | | | | | | | | | | Create a new context type so that "for (1,2,3)" and "for (@ary)" are now two separate types. For the list type, we store the index of the base stack element in the state union rather than having an array pointer. Currently this is just the same as blk_resetsp, but this will shortly allow us to eliminate the resetsp field from the struct block_loop - which is currently the largest sub-struct within the block union. Having two separate types also allows the two cases to be handled directly in the main switch in the hot pp_iter code, rather than having extra conditionals.
* regen/mk_invlists.pl: add braces round subobject initialisersAaron Crane2016-01-211-14/+12
| | | | | | This suppresses many clang warnings saying "suggest braces around initialization of subobject" when the generated charclass_invlists.h is included.
* Use lookup table for /\b{gcb}/ instead of switch stmtKarl Williamson2016-01-191-7/+116
| | | | | | | | | | | | | | This changes the handling of Grapheme Cluster Breaks to be entirely via a lookup table generated by regen/mk_invlists.pl. This is easier to maintain and follow, as the generation of the table follows the text of Unicode's UAX29 precisely, and loops can be used to set every class up instead of having to name each explicitly, so it will be easier to add new rules. And the runtime switch statement is replaced by a single line. My gcc compiler optimized the previous version to an array lookup, but this commit does it for not so clever compilers.
* Add qr/\b{lb}/Karl Williamson2016-01-191-1/+567
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* Make tables for Perl-tailored Unicode Line_Break propertyKarl Williamson2016-01-191-0/+38
| | | | | | | | | This is in preparation for adding qr/\b{lb}/. This just generates the tables, and is a separate commit because otherwise the diff listing is confusing, as it doesn't realize there are only additions. So, even though the difference listing for this commit for the generated header file is wildly crazy, the only changes in reality are the addition of some tables for Line Break.
* regen/mk_invlists.pl: Use property's real valuesKarl Williamson2016-01-191-1/+8
| | | | | | | A future commit will tailor a property to use fewer values than Unicode provides. Currently we look at the official property, and croak if not all the property values are there. This commit instead looks at the tailored property, the one that actually is being output.
* regen/mk_invlists.pl: Internal housekeepingKarl Williamson2016-01-191-2/+1
| | | | | | | | | This moves the name of a synthetic enum value to a better place in the code. The list it had been in is for a specific purpose that is not applicable to synthetic values, though it worked. But the new place is more logical, and can take advantage of the previous commit which makes things in this place more predictable.
* regen/mk_invlists.pl: Keep internal enum values lastKarl Williamson2016-01-191-7/+12
| | | | | | | | | | | | | | | | | | | | | Most Unicode properties have a finite set of possible values. Most, for example, are binary, they can be either true or false, but nothing in between. Others have more possibilities (and still others, like Name, are not restricted at all. The Word Break property, for example can take on a restricted set of values, currently 19 in all, that indicate what type, for purposes of word breaking, the character is. In implementing things like Word Break, Perl adds some internal-only values, like EDGE, which means matching like /^/ or /$/. By using these synthetic values, we don't need to have extra code for edge cases. These properties are implemented using C enums. Prior to this commit, the actual numeric values for each enum was mostly arbitrary, with the synthetic ones intermixed with the offical ones. This commit changes that so the synthetic ones are all higher numbers than any official ones, and the order they appear in the generating code will be the numerical order they have, so that the program has control of their order.
* embed.fnc and regen: detect duplicate fn defsDavid Mitchell2016-01-151-2/+23
| | | | | | Detect if the same function is defined more than once in embed.fnc - but only outside of any #if..#endif nesting, since that might include alternative definitions of the same function.
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-19/+19
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Tailor \b{wb} for PerlKarl Williamson2016-01-081-0/+1
| | | | | | | | | | | | The Unicode \b{wb} matches the boundary between space characters in a span of them. This is opposite of what \b does, and is counterintuitive to Perl expectations. This commit tailors \b{wb} to not split up spans of white space. I have submitted a request to Unicode to re-examine their algorithm, and this has been assigned to a subcommittee to look at, but the result won't be available until after 5.24 is done. In any event, Unicode encourages tailoring for local conditions.
* Deparse.pm lives in lib/B now, not dist/B-DeparseLukas Mai2016-01-081-17/+12
|
* Skip casing for high code pointsKarl Williamson2015-12-091-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in the previous commit, most code points in Unicode don't change if upper-, or lower-cased, etc. In fact as of Unicode v8.0, 93% of the available code points are above the highest one that does change. This commit skips trying to case these 93%. A regen/ script keeps track of the max changing one in the current Unicode release, and skips casing for the higher ones. Thus currently, casing emoji will be skipped. Together with the previous commits that dealt with casing, the potential for huge memory requirements for the swash hashes for casing are severely limited. If the following command is run on a perl compiled with -O2 and no DEBUGGING: blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after and the file 'plane1_case_perf' contains [ 'string::casing::emoji' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK code => 'uc($a)' }, ]; the following results are obtained: The numbers represent raw counts per loop iteration. string::casing::emoji yes swash vs no swash before_this_commit after ------------------ -------- Ir 981.0 306.0 Dr 228.0 94.0 Dw 100.0 45.0 COND 137.0 49.0 IND 7.0 4.0 COND_m 5.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 -0.1 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-252-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* regen/ebcdic.pl: Output tables in hexKarl Williamson2015-11-251-8/+12
| | | | | | | | | | When dealing with code points, it is easier to use the hex values. This outputs the tables in hex, squeezing them so they barely fit in an 80 column window. That they didn't use to so fit was why they were not output in hex prior to this commit. The UTF8SKIP table is continued to be output in decimal, as the values aren't code points.
* split pp_postdec() from pp_postinc() and improveDavid Mitchell2015-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pp_postinc() handles both $x++ and $x-- (and the integer variants pp_i_postinc/dec). Split it into two separate functions, as handling both inc and dec in the same function requires 3 extra conditionals. At the same time make the code more efficient. As currently written it: 1) checked for "bad" SVs (such as read-only) and croaked; 2) did a sv_setsv(TARG, TOPs) to return a copy of the original value; 2) checked for a IOK-only SV and if so, directly incremented the IVX slot; 3) else called out to sv_inc/dec() to handle the more complex cases. This commit combines the checks in (1) and (3) into one single big check of flags, and for the simple integer case, skips 2) and does a more efficient SETi() instead. For the non-simple case, both pp_postinc() and pp_postdec() now call a common static function to handle everything else. Porting/bench.pl shows the following raw numbers for '$y = $x++' ($x and $y lexical and holding integers): before after ------ ----- Ir 306.0 223.0 Dr 106.0 82.0 Dw 51.0 44.0 COND 48.0 33.0 IND 8.0 6.0 COND_m 1.9 0.0 IND_m 4.0 4.0
* split pp_predec() from pp_preinc() and improveDavid Mitchell2015-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pp_preinc() handles both ++$x and --$x (and the integer variants pp_i_preinc/dec). Split it into two separate functions, as handling both inc and dec in the same function requires 3 extra conditionals. At the same time make the code more efficient. As currently written it: 1) checked for "bad" SVs (such as read-only) and croaked; 2) checked for a IOK-only SV and directly incremented the IVX slot; 3) else called out to sv_inc() to handle the more complex cases. This commit combines the checks in (1) and (2) into one single big check of flags, and anything "bad" simply skips the IOK-only code and calls sv_dec(), which can do its own checking of read-only etc and croak if necessary. Porting/bench.pl shows the following raw numbers for ++$x ($x lexical and holding an integer): before after -------- -------- Ir 77.0 56.0 Dr 30.0 24.0 Dw 10.0 10.0 COND 12.0 9.0 IND 2.0 2.0 COND_m -0.1 0.0 IND_m 2.0 2.0 Even having split the function into two, the combined size of the two new functions is smaller than the single previous function.
* [perl #126051] make the warnings::enabled example use warnings::enabledTony Cook2015-10-121-2/+4
| | | | | | 7e6d00f88633 added the warnif() function and changed most uses of warnings:enabled() to use warnif(), including this one. Revert just that part.
* sync regen/warnings.pl and warnings.pm $VERSIONTony Cook2015-10-121-2/+9
| | | | | | | regen/warnings.pl's $VERSION was at 1.04 despite it being modified each time warnings.pm is modified. So make them use the same version number.
* Cleanup, document, and restructure regen/regcomp.plYves Orton2015-10-051-282/+478
| | | | | | | | | | | | | | | | We cleanup the parsing code, replacing our set of arrays of properties with an array of hashes of properties, with utility subs registering new items, etc. We also split up the output code into a set of subs, one sub per output "blob" (generaly a var definition), so that we have some visibility of the higher level strucuture of our output code. With this patch visibility of the structure of what we generate emerges from the nest of here docs. :-) Note this change does not (greatly) alter regcomp.sym or perldebguts.pod, it merely cleans up and generally speaking modernizes and most importantly documents the code.