summaryrefslogtreecommitdiff
path: root/regen
Commit message (Collapse)AuthorAgeFilesLines
* Use table lookup for qr/\b{wb}/Karl Williamson2016-02-031-0/+240
| | | | | | | | This follows the recent commits for lb and gcb, and generates a table at regen time for Word Breaking. The result may run faster, depending on the compiler optimization capabilities, than before, and is easier to maintain, as it's easier to smack a new rule into the regen perl script than it is to change the C code.
* split CXt_LOOP_FOR into CXt_LOOP_LIST,CXt_LOOP_ARYDavid Mitchell2016-02-031-2/+3
| | | | | | | | | | | | | | | Create a new context type so that "for (1,2,3)" and "for (@ary)" are now two separate types. For the list type, we store the index of the base stack element in the state union rather than having an array pointer. Currently this is just the same as blk_resetsp, but this will shortly allow us to eliminate the resetsp field from the struct block_loop - which is currently the largest sub-struct within the block union. Having two separate types also allows the two cases to be handled directly in the main switch in the hot pp_iter code, rather than having extra conditionals.
* regen/mk_invlists.pl: add braces round subobject initialisersAaron Crane2016-01-211-14/+12
| | | | | | This suppresses many clang warnings saying "suggest braces around initialization of subobject" when the generated charclass_invlists.h is included.
* Use lookup table for /\b{gcb}/ instead of switch stmtKarl Williamson2016-01-191-7/+116
| | | | | | | | | | | | | | This changes the handling of Grapheme Cluster Breaks to be entirely via a lookup table generated by regen/mk_invlists.pl. This is easier to maintain and follow, as the generation of the table follows the text of Unicode's UAX29 precisely, and loops can be used to set every class up instead of having to name each explicitly, so it will be easier to add new rules. And the runtime switch statement is replaced by a single line. My gcc compiler optimized the previous version to an array lookup, but this commit does it for not so clever compilers.
* Add qr/\b{lb}/Karl Williamson2016-01-191-1/+567
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* Make tables for Perl-tailored Unicode Line_Break propertyKarl Williamson2016-01-191-0/+38
| | | | | | | | | This is in preparation for adding qr/\b{lb}/. This just generates the tables, and is a separate commit because otherwise the diff listing is confusing, as it doesn't realize there are only additions. So, even though the difference listing for this commit for the generated header file is wildly crazy, the only changes in reality are the addition of some tables for Line Break.
* regen/mk_invlists.pl: Use property's real valuesKarl Williamson2016-01-191-1/+8
| | | | | | | A future commit will tailor a property to use fewer values than Unicode provides. Currently we look at the official property, and croak if not all the property values are there. This commit instead looks at the tailored property, the one that actually is being output.
* regen/mk_invlists.pl: Internal housekeepingKarl Williamson2016-01-191-2/+1
| | | | | | | | | This moves the name of a synthetic enum value to a better place in the code. The list it had been in is for a specific purpose that is not applicable to synthetic values, though it worked. But the new place is more logical, and can take advantage of the previous commit which makes things in this place more predictable.
* regen/mk_invlists.pl: Keep internal enum values lastKarl Williamson2016-01-191-7/+12
| | | | | | | | | | | | | | | | | | | | | Most Unicode properties have a finite set of possible values. Most, for example, are binary, they can be either true or false, but nothing in between. Others have more possibilities (and still others, like Name, are not restricted at all. The Word Break property, for example can take on a restricted set of values, currently 19 in all, that indicate what type, for purposes of word breaking, the character is. In implementing things like Word Break, Perl adds some internal-only values, like EDGE, which means matching like /^/ or /$/. By using these synthetic values, we don't need to have extra code for edge cases. These properties are implemented using C enums. Prior to this commit, the actual numeric values for each enum was mostly arbitrary, with the synthetic ones intermixed with the offical ones. This commit changes that so the synthetic ones are all higher numbers than any official ones, and the order they appear in the generating code will be the numerical order they have, so that the program has control of their order.
* embed.fnc and regen: detect duplicate fn defsDavid Mitchell2016-01-151-2/+23
| | | | | | Detect if the same function is defined more than once in embed.fnc - but only outside of any #if..#endif nesting, since that might include alternative definitions of the same function.
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-19/+19
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Tailor \b{wb} for PerlKarl Williamson2016-01-081-0/+1
| | | | | | | | | | | | The Unicode \b{wb} matches the boundary between space characters in a span of them. This is opposite of what \b does, and is counterintuitive to Perl expectations. This commit tailors \b{wb} to not split up spans of white space. I have submitted a request to Unicode to re-examine their algorithm, and this has been assigned to a subcommittee to look at, but the result won't be available until after 5.24 is done. In any event, Unicode encourages tailoring for local conditions.
* Deparse.pm lives in lib/B now, not dist/B-DeparseLukas Mai2016-01-081-17/+12
|
* Skip casing for high code pointsKarl Williamson2015-12-091-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in the previous commit, most code points in Unicode don't change if upper-, or lower-cased, etc. In fact as of Unicode v8.0, 93% of the available code points are above the highest one that does change. This commit skips trying to case these 93%. A regen/ script keeps track of the max changing one in the current Unicode release, and skips casing for the higher ones. Thus currently, casing emoji will be skipped. Together with the previous commits that dealt with casing, the potential for huge memory requirements for the swash hashes for casing are severely limited. If the following command is run on a perl compiled with -O2 and no DEBUGGING: blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after and the file 'plane1_case_perf' contains [ 'string::casing::emoji' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK code => 'uc($a)' }, ]; the following results are obtained: The numbers represent raw counts per loop iteration. string::casing::emoji yes swash vs no swash before_this_commit after ------------------ -------- Ir 981.0 306.0 Dr 228.0 94.0 Dw 100.0 45.0 COND 137.0 49.0 IND 7.0 4.0 COND_m 5.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 -0.1 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-252-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* regen/ebcdic.pl: Output tables in hexKarl Williamson2015-11-251-8/+12
| | | | | | | | | | When dealing with code points, it is easier to use the hex values. This outputs the tables in hex, squeezing them so they barely fit in an 80 column window. That they didn't use to so fit was why they were not output in hex prior to this commit. The UTF8SKIP table is continued to be output in decimal, as the values aren't code points.
* split pp_postdec() from pp_postinc() and improveDavid Mitchell2015-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pp_postinc() handles both $x++ and $x-- (and the integer variants pp_i_postinc/dec). Split it into two separate functions, as handling both inc and dec in the same function requires 3 extra conditionals. At the same time make the code more efficient. As currently written it: 1) checked for "bad" SVs (such as read-only) and croaked; 2) did a sv_setsv(TARG, TOPs) to return a copy of the original value; 2) checked for a IOK-only SV and if so, directly incremented the IVX slot; 3) else called out to sv_inc/dec() to handle the more complex cases. This commit combines the checks in (1) and (3) into one single big check of flags, and for the simple integer case, skips 2) and does a more efficient SETi() instead. For the non-simple case, both pp_postinc() and pp_postdec() now call a common static function to handle everything else. Porting/bench.pl shows the following raw numbers for '$y = $x++' ($x and $y lexical and holding integers): before after ------ ----- Ir 306.0 223.0 Dr 106.0 82.0 Dw 51.0 44.0 COND 48.0 33.0 IND 8.0 6.0 COND_m 1.9 0.0 IND_m 4.0 4.0
* split pp_predec() from pp_preinc() and improveDavid Mitchell2015-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pp_preinc() handles both ++$x and --$x (and the integer variants pp_i_preinc/dec). Split it into two separate functions, as handling both inc and dec in the same function requires 3 extra conditionals. At the same time make the code more efficient. As currently written it: 1) checked for "bad" SVs (such as read-only) and croaked; 2) checked for a IOK-only SV and directly incremented the IVX slot; 3) else called out to sv_inc() to handle the more complex cases. This commit combines the checks in (1) and (2) into one single big check of flags, and anything "bad" simply skips the IOK-only code and calls sv_dec(), which can do its own checking of read-only etc and croak if necessary. Porting/bench.pl shows the following raw numbers for ++$x ($x lexical and holding an integer): before after -------- -------- Ir 77.0 56.0 Dr 30.0 24.0 Dw 10.0 10.0 COND 12.0 9.0 IND 2.0 2.0 COND_m -0.1 0.0 IND_m 2.0 2.0 Even having split the function into two, the combined size of the two new functions is smaller than the single previous function.
* [perl #126051] make the warnings::enabled example use warnings::enabledTony Cook2015-10-121-2/+4
| | | | | | 7e6d00f88633 added the warnif() function and changed most uses of warnings:enabled() to use warnif(), including this one. Revert just that part.
* sync regen/warnings.pl and warnings.pm $VERSIONTony Cook2015-10-121-2/+9
| | | | | | | regen/warnings.pl's $VERSION was at 1.04 despite it being modified each time warnings.pm is modified. So make them use the same version number.
* Cleanup, document, and restructure regen/regcomp.plYves Orton2015-10-051-282/+478
| | | | | | | | | | | | | | | | We cleanup the parsing code, replacing our set of arrays of properties with an array of hashes of properties, with utility subs registering new items, etc. We also split up the output code into a set of subs, one sub per output "blob" (generaly a var definition), so that we have some visibility of the higher level strucuture of our output code. With this patch visibility of the structure of what we generate emerges from the nest of here docs. :-) Note this change does not (greatly) alter regcomp.sym or perldebguts.pod, it merely cleans up and generally speaking modernizes and most importantly documents the code.
* remove documentation for the now-removed lexical topicRicardo Signes2015-10-021-1/+1
|
* Match ops no longer need OPpTARGET_MYFather Chrysostomos2015-09-291-3/+0
| | | | Actually, I don’t think they have needed it for a while.
* Remove OPpGREP_LEXFather Chrysostomos2015-09-291-5/+0
| | | | It is no longer used.
* Bump $warnings::VERSION to 1.34Father Chrysostomos2015-09-291-1/+1
|
* Remove experimental::lexical_topic warnings categoryFather Chrysostomos2015-09-291-2/+0
|
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+4
| | | | | | This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC platforms to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* l1_char_class_tab.h: Add commentsKarl Williamson2015-09-041-0/+1
| | | | | | This adds the I8 value (used for generating UTF-EBCDIC) for bytes where it differs from the regular value on the EBCDIC portions of this header. This value is useful in debugging.
* l1_char_class_tab.h: Add bits for UTF-EBCDICKarl Williamson2015-09-041-0/+32
| | | | This is for the next commit.
* regen/mk_PL_charclass.pl: Refactor a printKarl Williamson2015-09-041-3/+7
| | | | This is in preparation for the next commits.
* Remove no longer used #defineKarl Williamson2015-09-041-1/+0
| | | | The previous commit removed all uses of this non-public #define.
* ck_refassign: selectively copy OPpPAD_INTRO/STATEDavid Mitchell2015-08-191-0/+6
| | | | | | | | | | | | | | | Previously this function unconditionally copied the OPpLVAL_INTRO and OPpPAD_STATE flags from the LH var op to the refassign op, even when those flag bits weren't used or meant something different. This commit makes the copying more selective. It also makes clear by code comments and asserts, that the refassign op uses bit 6, OPpPAD_STATE, to mean either that or OPpOUR_INTRO depending on the type of LHS. I couldn't think of any test that would would break under the old regime, but this future-proofs the code against new flags and meanings.
* re-implement OPpASSIGN_COMMON mechanismDavid Mitchell2015-08-171-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit almost completely replaces the current mechanism for detecting and handing common vars in list assignment, e.g. ($a,$b) = ($b,$a); In general outline: it creates more false positives at compile-time than before, but also no longer misses some false negatives. In compensation, it considerably reduces the run-time cost of handling potential and real commonality. It does this firstly by splitting the OPpASSIGN_COMMON flag into 3 separate flags: OPpASSIGN_COMMON_AGG OPpASSIGN_COMMON_RC1 OPpASSIGN_COMMON_SCALAR which indicate different classes of commonality that can be handled in different ways at runtime. Most importantly, it distinguishes between two basic cases. Firstly, common scalars (OPpASSIGN_COMMON_SCALAR), e.g. ($x,....) = (....,$x,...) where $x is modified and then sometime later its value is used again, but that value has changed in the meantime. In this case, we need replace such vars on the RHS with mortal copies before processing the assign. The second case is an aggregate on the LHS (OPpASSIGN_COMMON_AGG), e.g. (...,@a) = (...., $a[0],...) In this case, the issue is instead that when @a is cleared, it may free items on the RHS (due to the stack not being ref counted). What is required here is that rather than making of a copy of each RHS element and storing it in the array as we progress, we make *all* the copies *before* clearing the array, but mortalise them in case we die in the meantime. We can further distinguish two scalar cases; sometimes it's possible to confirm non-commonality at run-time merely by checking that all the LHS scalars have a reference count of 1. If this is possible, we set the OPpASSIGN_COMMON_RC1 flag rather than the OPpASSIGN_COMMON_SCALAR flag. The major improvement in the run-time performance in the OPpASSIGN_COMMON_SCALAR case (or OPpASSIGN_COMMON_RC1 if rc>1 scalars are detected), is to use a mark-and-sweep scan of the two lists using the SVf_BREAK flag, to determine which elements are common, and only make mortal copies of those elements. This has a very big effect on run-time performance; for example in the classic ($a,$b) = ($b,$a); it would formerly make temp copies of both $a and $b; now it only copies $a. In more detail, the mark and sweep mechanism in pp_aassign works by looping through each LHS and RHS SV pair in parallel. It temporarily marks each LHS SV with the SVf_BREAK flag, then makes a copy of each RHS element only if it has the SVf_BREAK flag set. When the scan is finished, the flag is unset on all LHS elements. One major change in compile-time flagging is that package scalar vars are now treated as if they could always be aliased. So we don't bother any more to do the compile-time PL_generation checking on package vars (we still do it on lexical vars). We also no longer make use of the run-time PL_sawalias mechanism for detecting aliased package vars (and indeed the next commit but one will remove that mechanism). This means that more list assignment expressions which feature package vars will now need to do a runtime mark-and-sweep (or where appropriate, RC1) test. In compensation, we no longer need to test for aliasing and set PL_sawalias in pp_gvsv and pp_gv, nor reset PL_sawalias in every pp_nextstate. Part of the reasoning behind this is that it's nearly impossible to detect all possible package var aliasing; for example PL_sawalias would fail to detect XS code doing GvSV(gv) = sv. Note that we now scan the two children of the OP_AASSIGN separately, and in particular we mark lexicals with PL_generation only on the LHS and test only on the RHS. So something like ($x,$y) = ($default, $default) will no longer be regarded as having common vars. In terms of performance, running Porting/perlbench.pl on the new expr::aassign:: tests in t/perf/benchmarks show that the biggest slowdown is around 13% more instruction reads and 20% more conditional branches in this: setup => 'my ($v1,$v2,$v3) = 1..3; ($x,$y,$z) = 1..3;', code => '($x,$y,$z) = ($v1,$v2,$v3)', where this is now a false positive due to the presence of package variables. The biggest speedup is 50% less instruction reads and conditional branches in this: setup => '@_ = 1..3; my ($x,$y,$z)', code => '($x,$y,$z) = @_', because formerly the presence of @_ pessimised things if the LHS wasn't a my declaration (it's still pessimised, but the runtime's faster now). Conversely, we pessimise the 'my' variant too now: setup => '@_ = 1..3;', code => 'my ($x,$y,$z) = @_', this gives 5% more instruction reads and 11% more conditional branches now. But see the next commit, which will cheat for that particular construct.
* No __attribute__((nonnull(...))) from NN.Jarkko Hietaniemi2015-08-141-4/+0
|
* regen/mk_PL_charclass.pl: Suppress extra null array elementKarl Williamson2015-08-011-0/+2
| | | | | | We don't output a trailing comma after the final element in these C arrays, and thus prevent the C compiler from generating a useless null element
* mktables: Fix up Name_Alias in early UnicodesKarl Williamson2015-07-281-2/+3
| | | | | | | | | | | perl needs the Name_Alias property accessible in all releases in order for charnames to work properly. However the property was not created until Unicode version 5.0. Previously, the property was made available to all Unicode versions, which is contrary to the policy of exposing properties to public use only when Unicode so exposes them. Thus the behavior is as close as possible to Unicode-specified. This commit creates an internal-only property for the perl core, and removes the general access on early Unicode releases.
* mktables: Add handling of WB and SB for early UnicodesKarl Williamson2015-07-281-2/+2
| | | | | | | This allows \b{wb} and \b{sb} to work on all Unicode releases. The huge number of differences in charclass_invlists.h is only because the names of the SB and WB tables change, and the code automatically re-alphabetizes things.
* mktables: Fix GCB to work on early UnicodesKarl Williamson2015-07-281-5/+8
| | | | | | | | The GCB property was not properly being generated in early Unicode releases. The huge commit diff is due solely to the fact that the name changes of this property so it is sure to not be accessible outside the perl core, and the property tables are automatically resorted alphabetically.
* regen/mk_invlists.pl: Handle early Unicodes CFKarl Williamson2015-07-281-1/+2
| | | | | In very early Unicode releases, the case folding table can be in a different format.
* regen/mk_invlists.pl: White-space onlyKarl Williamson2015-07-281-6/+6
| | | | Reindent after the previous commit introduced an outer block
* regen/mk_invlists.pl: Properly handle empty propertiesKarl Williamson2015-07-281-7/+23
| | | | | This failed to adequately handle empty properties; something that wasn't seen until compiling older Unicode releases.
* regen/mk_PL_charclass.pl: Use names known in all UnicodesKarl Williamson2015-07-281-1/+7
| | | | | This just changes, for properties that aren't defined in all Unicode versions, to use synonyms that are defined in all
* regen/regcharclass.pl: Work on early UnicodesKarl Williamson2015-07-281-3/+3
| | | | | This just changes, for properties that aren't defined in all Unicode versions, to use synonyms that are defined in all
* regen/mk_PL_charclass.pl: Don't confuse simple with multi foldsKarl Williamson2015-07-281-3/+15
| | | | | | | On early Unicode releases, this was saying that a character had a simple fold from above Latin1, whereas it didn't. This was caused by not keeping the simple folds separate from the multi-character ones. The solution is to keep a separate data structure for the simple ones.
* regen/unicode_constants.pl: Add U+130, +131Karl Williamson2015-07-281-0/+2
| | | | These will be used in the next commit
* regen/mk_PL_charclass.pl: Add extra info to debug lineKarl Williamson2015-07-281-1/+1
| | | | | This is currently commented out, but this is helpful during the times when it is used.
* regen/regcharclass.pl: Handle empty listsKarl Williamson2015-07-281-0/+2
| | | | | Short circuit the remaining code and return a 0 if the input doesn't match anything
* There are no folds to multiple chars in early Unicode versionsKarl Williamson2015-07-281-0/+2
| | | | | Several places require special handling because of this, notably for the lowercase Sharp S, but not in Unicodes before 3.0.1
* regen/unicode_constants.pl: Generate #defines giving which Unicode versionKarl Williamson2015-07-281-4/+17
| | | | | Future commits will want to take different actions depending on which Unicode version is being used.
* regen/unicode_constants.pl: Skip U+1E9E if not in Unicode versionKarl Williamson2015-07-281-1/+1
| | | | | LATIN CAPITAL LETTER SHARP S is not available in all Unicode releases; simply skip generating things when it isn't there.