summaryrefslogtreecommitdiff
path: root/embed.fnc
Commit message (Collapse)AuthorAgeFilesLines
...
* Inline dfa for translating from UTF-8Karl Williamson2018-07-051-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit inlines the simple portion of the dfa that translates from UTF-8 to code points, used in functions like utf8_to_uvchr_buf. This dfa has been changed in previous commits so that it is small, and punts on any problematic input, plus 18% of the Hangul syllable code points. (These still come out faster than blead.) The smallness allows it to be inlined, adding <2000 total bytes to the perl text space. The inlined part never calls anything that needs thread context, so that parameter can be removed. I decided to remove it also from the Perl_utf8_to_uvchr_buf() and Perl_utf8n_to_uvchr_error() functions. There is a small risk that someone is actually using those functions instead of the documented macros utf8_to_uvchr_buf() and utf8n_to_uvchr_error(). If so, this can be added back in. Perl_utf8_to_uvchr_msgs() is entirely removed, but the macro utf8_to_uvchr_msgs() which is the normal interface to it is retained unchanged, and it is marked as unstable anyway. This change decreases the number of conditional branches in the Perl statement my $a = ord("\x{foo}") where foo is a non-problematic code point by about 11%, except for ASCII characters, where it is 4%, and those Hangul syllables mentioned above, where it is 7%. Problematic code points fare much worse here than in blead. These are the surrogates, non-characters, and non-Unicode code points. We don't care very much about the speed of handling these code points, which are mostly considered illegal by Unicode anyway. The percentage decrease is higher for the just the function itself, as the measured Perl statement has unchanged overhead. Here are the annotated benchmarks: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. translate_utf8_to_uv_007f my $a = ord("\x{007f}") blead dfa Ratio % ----- ----- ------- Ir 395.0 370.0 106.8 Dr 122.0 115.0 106.1 Dw 71.0 61.0 116.4 COND 49.0 47.0 104.3 IND 5.0 5.0 100.0 In all the measurements, the indirect numbers were all zeros and unchanged, and are omitted in this message. translate_utf8_to_uv_07ff my $a = ord("\x{07ff}") blead dfa Ratio % ----- ----- ------- Ir 438.0 390.0 112.3 Dr 128.0 118.0 108.5 Dw 71.0 61.0 116.4 COND 57.0 51.0 111.8 IND 5.0 5.0 100.0 translate_utf8_to_uv_cfff my $a = ord("\x{cfff}") This is the highest Hangul syllable that gets the full reduction. blead dfa Ratio % ----- ----- ------- Ir 457.0 410.0 111.5 Dr 131.0 121.0 108.3 Dw 71.0 61.0 116.4 COND 61.0 55.0 110.9 IND 5.0 5.0 100.0 translate_utf8_to_uv_d000 my $a = ord("\x{d000}") This is the lowest affected Hangul syllable blead dfa Ratio % ----- ----- ------- Ir 457.0 443.0 103.2 Dr 131.0 132.0 99.2 Dw 71.0 71.0 100.0 COND 61.0 57.0 107.0 IND 5.0 5.0 100.0 translate_utf8_to_uv_d7ff my $a = ord("\x{d7ff}") This is the highest affected Hangul syllable blead dfa Ratio % ----- ----- ------- Ir 457.0 443.0 103.2 Dr 131.0 132.0 99.2 Dw 71.0 71.0 100.0 COND 61.0 57.0 107.0 IND 5.0 5.0 100.0 translate_utf8_to_uv_d800 my $a = ord("\x{d800}") This is a surrogate, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 457.0 515.0 88.7 Dr 131.0 134.0 97.8 Dw 71.0 73.0 97.3 COND 61.0 75.0 81.3 IND 5.0 5.0 100.0 translate_utf8_to_uv_fdd0 my $a = ord("\x{fdd0}") This is a non-char, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 457.0 548.0 83.4 Dr 131.0 139.0 94.2 Dw 71.0 73.0 97.3 COND 61.0 81.0 75.3 IND 5.0 5.0 100.0 translate_utf8_to_uv_fffd my $a = ord("\x{fffd}") blead dfa Ratio % ----- ----- ------- Ir 457.0 410.0 111.5 Dr 131.0 121.0 108.3 Dw 71.0 61.0 116.4 COND 61.0 55.0 110.9 IND 5.0 5.0 100.0 translate_utf8_to_uv_ffff my $a = ord("\x{ffff}") This is another non-char, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 457.0 548.0 83.4 Dr 131.0 139.0 94.2 Dw 71.0 73.0 97.3 COND 61.0 81.0 75.3 IND 5.0 5.0 100.0 translate_utf8_to_uv_1fffd my $a = ord("\x{1fffd}") blead dfa Ratio % ----- ----- ------- Ir 476.0 430.0 110.7 Dr 134.0 124.0 108.1 Dw 71.0 61.0 116.4 COND 65.0 59.0 110.2 IND 5.0 5.0 100.0 translate_utf8_to_uv_10fffd my $a = ord("\x{10fffd}") blead dfa Ratio % ----- ----- ------- Ir 476.0 430.0 110.7 Dr 134.0 124.0 108.1 Dw 71.0 61.0 116.4 COND 65.0 59.0 110.2 IND 5.0 5.0 100.0 translate_utf8_to_uv_110000 my $a = ord("\x{110000}") This is a non-Unicode code point, showing much worse performance, but we don't care blead dfa Ratio % ----- ----- ------- Ir 476.0 544.0 87.5 Dr 134.0 137.0 97.8 Dw 71.0 73.0 97.3 COND 65.0 81.0 80.2 IND 5.0 5.0 100.0
* Make isUTF8_char() an inline functionKarl Williamson2018-07-051-0/+2
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should require fewer conditionals to be executed for many code points.
* Fix to compile under -DNO_LOCALEKarl Williamson2018-07-011-0/+2
| | | | | Several problems with this compile option were not caught before 5.28 was frozen.
* Remove some deprecated functions from mathoms.cKarl Williamson2018-06-281-28/+0
| | | | | These have been deprecated since 5.18, and have security issues, as they can try to read beyond the end of the buffer.
* Create my_atof3()Karl Williamson2018-06-251-1/+2
| | | | | | | | | | | | | | This is like my_atof2(), but with an extra argument signifying the length of the input string to parse. If that length is 0, it uses strlen() to determine it. Then my_atof2() just calls my_atof3() with a zero final parameter. And this commit just uses the bulk of the current my_atof2() as the core of my_atof3(). Changes were needed however, because it relied on NUL-termination in a number of places. This allows one to convert a string that isn't necessarily NUL-terminated to an NV.
* embed.fnc: Fix my_atof2() entryKarl Williamson2018-06-251-1/+1
| | | | | | This was using the incorrect formal parameter name. It did not generate an error because the function declares a variable with the incorrect name, so that this was actually asserting on the wrong thing.
* Use a perfect hash for Unicode property lookupsKarl Williamson2018-04-201-1/+0
| | | | | | The previous commits in this series have been preparing to allow the Devel::Tokenizer::C code to be swapped out for the much smaller perfect hash code.
* Bring all Unicode property definitions into coreKarl Williamson2018-04-201-0/+1
| | | | | | | | | | | This commit causes the looking up of \p{} Unicode properties to be done without having to use the swash mechanism.s, with certain exceptions. This will all be explained in the merge commit. This commit uses Devel::Tokenizer::C to generate the code that turns the property string as keywords into numbers that can be understood by the computer. This mechanism generates relatively large code. The next commits will replace this with a smaller mechanism.
* Set up initial \p{} parse function.Karl Williamson2018-04-201-0/+4
| | | | | | | | | | | | | | This function will parse the interior of \p{} Unicode property names in regular expression patterns. The design of this function will be to return NULL on the properties it cannot handle; otherwise it returns an inversion list representing the property it did find. The current mechanism will be used to handle the cases where this function returns NULL. This initial state is just to have the function return NULL always, so the existing mechanism is always used. A later commit will add the functionality in 5.28 that bypasses the existing mechanism.
* Move inversion lists to utf8.cKarl Williamson2018-04-201-0/+1
| | | | | | | | | | | | | | | | | These previously were statics in perl.c. A future commit would need access to these from regcomp.c. We could create an access function in perl.c so that regcomp.c could access them, or we could move them to regcomp.c. But doing that means also they would be statics in re_comp.c, and that would mean two copies. So that means an access function is needed. Their use is really unrelated to perl.c, which merely initializes them, so that could have an access function instead. But the most logical place for their home is utf8.c, which is described as for Unicode things, not just UTF-8 things. So this commit moves these inversion lists to utf8.c, and creates an initialization function called on perl startup from perl.c
* fix -DNO_MATHOMS build, mathomed syms were not removed from perldll.defDaniel Dragan2018-04-151-84/+84
| | | | | | | | | | | | | | | | | | | Commit 3f1866a8f6 assumed "A" flag means a function can't be mathomed. Not true. Many funcs were listed in embed.fnc as "A" yet were in mathoms.c. This caused a missing symbol link failure on Win32 with -DNO_MATHOMS, since the "A" mathomed funcs were now put into perlldll.def while previously they were parsed out of mathoms.c by makedef.pl. Revise the logic so "b" means instant removal from the export list on a no mathoms build. embed.fnc "b" flag adds were generated from a missing symbol list from my linker, some funcs not in my build/platform config might need to be "b" flagged in future. Some funcs like ASCII_TO_NEED were already marked "b" but still being by mistake exported because they were also "A". sv_2bool, sv_eq and sv_collxfrm also needed a "p" flag or a Perl_-less symbol was declared in proto.h. sv_2bool and sv_collxfrm also failed porting/args_assert.t so add those macros to mathoms.c
* Use unsigned to avoid compiler warningKarl Williamson2018-04-031-4/+4
| | | | | | | | | The code points that Unicode furnishes will always be unsigned. This changes to uniformly treat the ones in the constructed tables of Unicode properties to be unsigned, avoiding possible signedness compiler warnings on some systems. Spotted by Dave Mitchell.
* utf8.c: Remove unused thread context for core-only fcnKarl Williamson2018-03-311-1/+1
|
* regcomp.c: Rmv no longer used core-only functionKarl Williamson2018-03-311-1/+0
|
* utf8.c: Rmv no longer used functionKarl Williamson2018-03-311-1/+0
| | | | | The previous commit completely stopped using this core-only function. Remove it.
* Use compiled-in C structure for inverted case foldsKarl Williamson2018-03-311-0/+3
| | | | | | | | | | This commit changes to use the C data structures generated by the previous commit to compute what characters fold to a given one. This is used to find out what things should match under /i. This now avoids the expensive start up cost of switching to perl utf8_heavy.pl, loading a file from disk, and constructing a hash from it.
* regen/mk_invlists.pl: Inversion maps don't have to be IVKarl Williamson2018-03-311-1/+1
| | | | | An inversion map currently is used only for Unicode-range code points, which can fit in an int, so don't use the space unnecessarily
* regexec.c: Check for UTF-8 fittingKarl Williamson2018-03-311-1/+1
| | | | | | We've been burned before by malformed UTF-8 causing us to read outside the buffer bounds. Here is a case I saw during code inspection, and it's easy to add the buffer end limit
* Use charnames inversion listsKarl Williamson2018-03-311-1/+1
| | | | | | | | This commit makes the inversion lists for parsing character name global instead of interpreter level, so can be initialized once per process, and no copies are created upon new thread instantiation. More importantly, this is another instance where utf8_heavy.pl no longer needs to be loaded, and the definition files read from disk.
* utf8.c: Change no longer used params to dummysKarl Williamson2018-03-311-2/+2
| | | | | | The previous commits have caused certain parameters to be ignored in some calls to these functions. Change them to dummys, so if a mistake is made, it can be caught, and not promulgated
* embed.fnc: Add a const to parameterKarl Williamson2018-03-291-1/+1
| | | | To match what the file declares it as.
* Move UTF-8 case changing data into coreKarl Williamson2018-03-261-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this commit, if a program wanted to compute the case-change of a character above 0xFF, the C code would switch to perl, loading lib/utf8heavy.pl and then read another file from disk, and then create a hash. Future references would use the hash, but the start up cost is quite large. There are five case change types, uc, lc, tc, fc, and simple fc. Only the first encountered requires loading of utf8_heavy, but each required switching to utf8_heavy, and reading the appropriate file from disk. This commit changes these functions to use compiled-in C data structures (inversion maps) to represent the data. To look something up requires a binary search instead of a hash lookup. An individual hash lookup tends to be faster than a binary search, but the differences are small for small sizes. I did some benchmarking some years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and the results were that for fewer than 512 entries, the binary search was just as fast as a hash, if not actually faster. Now, I've done some more benchmarks on blead, using the tool benchmark.pl, which wasn't available back then. The results below indicate that the differences are minimal up through 2047 entries, which all Unicode properties are well within. A hash, PL_foldclosures, is still constructed at runtime for the case of regular expression /i matching, and this could be generated at Perl compile time, as a further enhancement for later. But reading a file from disk is no longer required to do this. ======================= benchmarking results ======================= Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. "\x{10000}" =~ qr/\p{CWKCF}/" swash invlist Ratio % fetch search ------ ------- ------- Ir 2259.0 2264.0 99.8 Dr 665.0 664.0 100.2 Dw 406.0 404.0 100.5 COND 406.0 405.0 100.2 IND 17.0 15.0 113.3 COND_m 8.0 8.0 100.0 IND_m 4.0 4.0 100.0 Ir_m1 8.9 17.0 52.4 Dr_m1 4.5 3.4 132.4 Dw_m1 1.9 1.2 158.3 Ir_mm 0.0 0.0 100.0 Dr_mm 0.0 0.0 100.0 Dw_mm 0.0 0.0 100.0 These were constructed by using the file whose contents are below, which uses the property in Unicode that currently has the largest number of entries in its inversion list, > 1600. The test was run on blead -O2, no debugging, no threads. Then the cut-off boundary was changed from 512 to 2047 for when we use a hash vs an inversion list, and the test run again. This yields the difference between a hash fetch and an inversion list binary search ===================== The benchmark file is below =============== no warnings 'once'; my @benchmarks; push @benchmarks, 'swash' => { desc => '"\x{10000}" =~ qr/\p{CWKCF}/"', setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a = "\x{10000}";', code => '$a =~ $re;', }; \@benchmarks;
* PATCH: [perl #127288] I18N::Langinfo sets UTF-8 bitKarl Williamson2018-03-121-1/+2
| | | | | | | This commit will turn UTF-8 on in the returned SV if its string is legal UTF-8 containing something besides ASCII, and the locale is a UTF-8 one. It is based on the patch included in the ticket, but is generalized to handle edge cases.
* EBCDIC conditional compilation fixesKarl Williamson2018-03-051-0/+2
| | | | | | | | | The recent changes fixed by this commit neglected to take into account EBCDIC differences. Mostly, the algorithms apply only to ASCII platforms, so the EBCDIC is ifdef'd out. In a couple cases, the algorithm mostly applies, so the scope of the ifdefs is smaller.
* add Perl_init_named_cv() functiomDavid Mitchell2018-03-021-0/+1
| | | | | | | This moves a block of code out from perly.y into its own function, because it will shortly be needed in more than one place. Should be no functional changes.
* Remove parameter from isSCRIPT_RUNKarl Williamson2018-03-011-2/+1
| | | | | | Daniel Dragan pointed out that this parameter is unused (the commits that want it didn't get into 5.28), and is causing a table to be duplicated all over the place, so just remove it for now.
* locale.c: Let static fcn take a NULL argumentKarl Williamson2018-02-251-1/+1
| | | | | | It turns out that it will be convenient in a future commit to have this function handle NULL input. That also means that every call should use the return value of this function.
* PATCH: [perl #132900] Blead Breaks CPAN: FELIPE/Crypt-PerlKarl Williamson2018-02-221-5/+5
| | | | | | | | The root cause of this was using a 'char' where it should have been 'U8'. I changed the signatures so that all the related functions take and return U8's, and the compiler detects what should be cast to/from char. The functions all deal with byte bit patterns, so unsigned is the appropriate declaration.
* ‘Nonelems’ for pushing sparse array on the stackFather Chrysostomos2018-02-181-0/+2
| | | | | | | | | | | | | | | | | To avoid having to create deferred elements every time a sparse array is pushed on to the stack, store a magic scalar in the array itself, which av_exists and refto recognise as not existing. This means there is only a one-time cost for putting such arrays on the stack. It also means that deferred elements that live long enough don’t start pointing to the wrong array entry if the array gets shifted (or unshifted/spliced) in the mean time. Instead, the scalar is already in the array, so it cannot lose its place. This fix only applies when the array as a whole is pushed on to the stack, but it could be extended in future commits to apply to other places where we currently use deferred elements.
* Add switch_to_globale_locale()Karl Williamson2018-02-181-0/+1
| | | | | | | | | This new API function is for use in applications that call alien library routines that are expecting the old pre-POSIX 2008 locale functionality, namely a single global locale accessible via setlocale(). This function converts the calling thread to use that global locale, if not already there.
* Revise sync_locale() for safe multi-threaded operationKarl Williamson2018-02-181-1/+1
| | | | | | This function now returns a boolean, and does not want an aTHX parameter. There should be no impact on code that uses the macro form to call it.
* Add thread-safe locale handlingKarl Williamson2018-02-181-0/+8
| | | | | | This (large) commit allows locales to be used in threaded perls on platforms that support it. This includes recent Windows and Posix 2008 ones.
* Make new_numeric() staticKarl Williamson2018-02-181-1/+1
| | | | This core-only function is now used only in one file.
* Add Perl_setlocale()Karl Williamson2018-02-181-1/+1
| | | | | | | | | | | | | | | | | | khw could not find any modules on CPAN that correctly use the C library function setlocale(). (The very few that do try, do not use it correctly, looking at the return value incorrectly, so they are broken.) This analysis does not include modules that call non-Perl libaries that may call setlocale(). And, a future commit will render the setlocale() function useless in some configurations on some platforms. So this commit adds Perl_setlocale(), for XS code to call, and which is always effective, but it should not be used to alter the locale except on platforms where the predefined variable ${^SAFE_LOCALES} evaluates to 1. This function is also what POSIX::setlocale() calls to do the real work.
* Add uvchr_to_utf8_flags_msgs()Karl Williamson2018-02-071-1/+3
| | | | | This is propmpted by Encode's needs. When called with the proper parameter, it returns any warnings instead of displaying them directly.
* utf8.c: Extract code into separate functionKarl Williamson2018-02-071-0/+3
| | | | | This is in preparation for the next commit which will use this code in multiple places
* Mark new function utf8n_to_uvchr_msgs() as experimentalKarl Williamson2018-02-011-1/+1
| | | | | | This function was written specifically for Encode's needs. My intent is to eventually make it publicly usable, but since it's new, we should give some time for it to prove itself.
* locale.c: Extract duplicated code into subroutinesKarl Williamson2018-01-311-0/+2
| | | | | These two paradigms are each repeated in 4 places. Make into two subroutines
* Give isSCRIPT_RUN() an extra parameterKarl Williamson2018-01-301-2/+5
| | | | This allows it to return the script of the run.
* Add ANYOFM regnodeKarl Williamson2018-01-301-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a specialized ANYOF node for use when the code points in it have characteristics that allow them to be matched with a mask instead of a bit map. When this happens, the speed up is pretty spectacular: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent raw counts per loop iteration. Results of ('b' x 10000) . 'a' =~ /[Aa]/ blead mask Ratio % -------- ------- ------- Ir 153132.0 25636.0 597.3 Dr 40909.0 2155.0 1898.3 Dw 20593.0 593.0 3472.7 COND 20529.0 3028.0 678.0 IND 22.0 22.0 100.0 See the comments in regcomp.c or http://nntp.perl.org/group/perl.perl5.porters/249001 for a description of the cases that this new technique can handle. But several common ones include the C0 controls (on ASCII platforms), [01], [0-7], [Aa] and any other ASCII case pair. The set of ASCII characters also could be done with this node instead of having the special ASCII regnode, reducing code size and complexity. I haven't investigated the speed loss of doing so. A NANYOFM node could be created for matching the complements this one matches. A pattern like /A/i is not affected by this commit, but the regex optimizer could be changed to take advantage of this commit. What would need to be done is for it to look at the first byte of an EXACTFish node and if its one of the case pairs this handles, to generate a synthetic start class for it. This would automatically invoke the sped up code.
* regcomp.c: Allow a fcn param to be NULLKarl Williamson2018-01-301-1/+1
| | | | | In which case handling is skipped. This is in preparation for a future commit which will use this function in a slightly different manner
* regexec.c: Use word-at-a-time to repeat /i single byte patternKarl Williamson2018-01-301-0/+2
| | | | | | For most of the case folding pairs, like [Aa], it is possible to use a mask to match them word-at-a-time in regrepeat(), so that long sequences of them are handled with significantly better performance.
* regexec.c: Use word-at-a-time to repeat a single byte patternKarl Williamson2018-01-301-0/+1
| | | | | | | There is special code in the function regrepeat() to handle instances where the pattern to repeat is a single byte. These all can be done word-at-a-time to significantly increase the performance of long repeats.
* Compile variant_byte_number() for EBCDICKarl Williamson2018-01-301-2/+0
| | | | Future commits will use this without regard to platform.
* Use different scheme to handle MSVC6Karl Williamson2018-01-301-1/+1
| | | | | | | | | | | Recent commit 0b08cab0fc46a5f381ca18a451f55cf12c81d966 caused a function to not be compiled when running on MSVC6, and hence its callers needed to use an alternative mechanism there. This is easy enough, it turns out, but it also turns out that there are more opportunities to call this function. Rather than having each caller have to know about the MSVC6 problem, this current commit reimplements the function on that platform to use a slow, dumb method, so knowing about the issue is confined to just this one function.
* Add utf8n_to_uvchr_msgs()Karl Williamson2018-01-301-1/+7
| | | | | | This UTF-8 to code point translator variant is to meet the needs of Encode, and provides XS authors with more general capability than the other decoders.
* Don't use variant_byte_number on MSVC6Karl Williamson2018-01-291-1/+1
| | | | See [perl #132766]
* embed.fnc: Formal param shouldn't be constKarl Williamson2018-01-241-1/+1
|
* tr///: return Size_t count rather than I32David Mitchell2018-01-191-7/+7
| | | | | | Change the signature of all the internal do_trans*() functions to return Size_t rather than I32, so that the count returned by tr//// can cope with strings longer than 2Gb.
* Revert "Revert "make PerlIO handle FD_CLOEXEC""Zefram2018-01-181-1/+3
| | | | | | This reverts commit 523d71b314dc75bd212794cc8392eab8267ea744, reinstating commit 2cdf406af42834c46ef407517daab0734f7066fc. Reversion is not the way to address the porting problem that motivated that reversion.