summaryrefslogtreecommitdiff
path: root/embed.h
Commit message (Collapse)AuthorAgeFilesLines
* study_chunk: honour mutate_ok over recursionHugo van der Sanden2020-06-011-1/+1
| | | | | | | | | | | | | | As described in #17743, study_chunk can re-enter itself either by simple recursion or by enframing. 089ad25d3f used the new mutate_ok variable to track whether we were within the framing scope of GOSUB, and to disallow mutating changes to ops if so. This commit extends that logic to reentry by recursion, passing in the current state as was_mutate_ok. (CVE-2020-12723) (cherry picked from commit 3445383845ed220eaa12cd406db2067eb7b8a741)
* study_chunk: extract rck_elide_nothingHugo van der Sanden2020-06-011-0/+1
| | | | | | (CVE-2020-10878) (cherry picked from commit 4fccd2d99bdeb28c2937c3220ea5334999564aa8)
* Add named sequences to Unicode wildcard name capabilitesKarl Williamson2020-03-201-2/+2
| | | | | | | | | Prior to this commit, specifying a named sequence would result in a mostly unhelpful fatal error message. This makes their use legal. This is also the beginning of allowing Unicode string properties, which are a new thing in the (still draft) Unicode requirements for regular expression parsing, UTS 18. Full compliance will have to come later.
* pp_match(): output regex debugging infoKarl Williamson2020-03-181-0/+1
| | | | | | | | This fixes #17612 This adds an inline function to pp_hot to be called to determine if debugging info should be output or not, regardless of whether it comes from -Dr, or from a 'use re Debug' statement
* chained comparisonsZefram2020-03-121-0/+3
|
* Rmv obsolete functionKarl Williamson2020-03-111-1/+0
| | | | | Use of this function was removed as part of adding wildcarding to the Unicode name property
* Allow debugging from regexec.c back to regcomp.cKarl Williamson2020-03-111-1/+8
| | | | | | | | | The compilation of User-defined properties in a regular expression that haven't been defined at the time that pattern is compiled is deferred until execution time. Until this commit, any request for debugging info on those was ignored. This fixes that by
* Add thread safety to some environment accessesKarl Williamson2020-03-111-0/+1
| | | | | | | | | | | | | | | | | | The previous commit added a mutex specifically for protecting against simultaneous accesses of the environment. This commit changes the normal getenv, putenv, and clearenv functions to use it, to avoid races. This makes the code simpler in places where we've gotten burned and added stuff to avoid races. Other places where we haven't known we were getting burned could have existed until now. Now that comes automatically, and we can remove the special cases we earlier stumbled over. getenv() returns a pointer to static memory, which can be overwritten at any moment from another thread, or even another getenv from the same thread. This commit changes the accesses to be under control of a mutex, and in the case of getenv, a mortalized copy is created so that there is no possible race.
* Implement \p{Name=/.../} wildcardsKarl Williamson2020-03-111-0/+1
| | | | | This commit adds wildcard subpatterns for the Name and Name Aliases properties.
* optimize sort by inlining comparison functionsTomasz Konojacki2020-03-091-0/+9
| | | | | | | | This makes special-cased forms such as sort { $b <=> $a } even faster. Also, since this commit removes PL_sort_RealCmp, it fixes the issue with nested sort calls mentioned in gh #16129
* Allow more debugging in re_comp.cKarl Williamson2020-03-021-2/+6
| | | | | | | | This adds two main functions that were previously only defined in regcomp.c to also be defined in re_comp.c. This allows re.pm to use debugging with them. To avoid duplicating large data structures, several lightweight wrapper functions are added to regcomp.c that re_comp.c calls to access those structures.
* Move two regex functions so that can use re debugKarl Williamson2020-03-021-2/+2
| | | | These have to have a version in re_comp.c for re.pm to work on them.
* embed.fnc: Reorder the entries dealing with regexesKarl Williamson2020-03-021-33/+33
| | | | | | This moves a bunch of entries around so that they make more sense, and consolidates some blocks that had the same #ifdefs. There should be no change in what gets compiled.
* regcomp.c: Add wrappers for cmplng/xctng wildcard subpatternsKarl Williamson2020-02-191-0/+2
| | | | | | | | This is in preparation for being called from more than one place. It has the salubrious effect that the wrapping we do around the user's supplied pattern is no longer visible in the Debug output of that pattern.
* regcomp.c: Create wrapper fcn for re_op_compileKarl Williamson2020-02-191-0/+1
| | | | | | This does the bulk of re_compile(), but is a private entry point, meaning it takes an extra parameter, and a future commit will call it from another place.
* Move some obsolete UTF-8 handling fcns to mathomsKarl Williamson2020-02-191-0/+6
| | | | | Two of the functions are internal to the core; the third has long been deprecated.
* Improve handling of nested qr/(?[...])/Karl Williamson2020-02-191-0/+1
| | | | | | | | | | | | | | | | | | | | A set operations expression can contain a previously-compiled one interpolated in. Prior to this commit, some heuristics were employed to verify it actually was such a thing, and not a sort of look-alike that wasn't necessarily valid. The heuristics actually forbade legal ones. I don't know of any illegal ones that were let through, but it is certainly possible. Also, the error/warning messages referred to the heuristics, and were unhelpful at best. The technique used instead in this commit is to return a regop only used by this feature for any nested compilations. This guarantees that the caller can determine if the result is valid, and what that result is without having to do any heuristics or inspecting any flags. The error/warning messages are changed to reflect this, and I believe are now helpful. This fixes the bugs in #16779 https://github.com/Perl/perl5/issues/16779#issuecomment-563987618
* toke.c: Split code to load _charnames.pm into own fncKarl Williamson2020-02-121-0/+1
| | | | This is in preparation for it being called from more than one place.
* utf8.c: Use common fcn for error messageKarl Williamson2020-01-231-1/+3
| | | | | There is now a function that generates this error message. This is so that it is always the same from wherever generated.
* Move cntrl_to_mnemonic() to util.c from regcomp.cKarl Williamson2020-01-231-1/+1
| | | | | This is in preparation for it being used elsewhere, to reduce duplication of code.
* Remove dquote_inline.hKarl Williamson2020-01-231-1/+3
| | | | | The remaining function in this file is moved to inline.h, just to not have an extra file lying around with hardly anything in it.
* (toke|regcomp).c: Use common fcn to handle \0 problemsKarl Williamson2020-01-231-1/+0
| | | | | | | | This changes warning messages for too short \0 octal constants to use the function introduced in the previous commit. This function assures a consistent and clear warning message, which is slightly different than the one this commit replaces. I know of no CPAN code which depends on this warning's wording.
* Restructure grok_bslash_[ox]Karl Williamson2020-01-231-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit causes these functions to allow a caller to request any messages generated to be returned to the caller, instead of always being handled within these functions. The messages are somewhat changed from previously to be clearer. I did not find any code in CPAN that relied on the previous message text. Like the previous commit for grok_bslash_c, here are two reasons to do this, repeated here. 1) In pattern compilation this brings these messages into conformity with the other ones that get generated in pattern compilation, where there is a particular syntax, including marking the exact position in the parse where the problem occurred. 2) These could generate truncated messages due to the (mostly) single-pass nature of pattern compilation that is now in effect. It keeps track of where during a parse a message has been output, and won't output it again if a second parsing pass turns out to be necessary. Prior to this commit, it had to assume that a message from one of these functions did get output, and this caused some out-of-bounds reads when a subparse (using a constructed pattern) was executed. The possibility of those went away in commit 5d894ca5213, which guarantees it won't try to read outside bounds, but that may still mean it is outputting text from the wrong parse, giving meaningless results. This commit should stop that possibility.
* Restructure grok_bslash_cKarl Williamson2020-01-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit causes this function to allow a caller to request any messages generated to be returned to the caller, instead of always being handled within this function. Like the previous commit for grok_bslash_c, here are two reasons to do this, repeated here. 1) In pattern compilation this brings these messages into conformity with the other ones that get generated in pattern compilation, where there is a particular syntax, including marking the exact position in the parse where the problem occurred. 2) The messages could be truncated due to the (mostly) single-pass nature of pattern compilation that is now in effect. It keeps track of where during a parse a message has been output, and won't output it again if a second parsing pass turns out to be necessary. Prior to this commit, it had to assume that a message from one of these functions did get output, and this caused some out-of-bounds reads when a subparse (using a constructed pattern) was executed. The possibility of those went away in commit 5d894ca5213, which guarantees it won't try to read outside bounds, but that may still mean it is outputting text from the wrong parse, giving meaningless results. This commit should stop that possibility.
* Hoist code point portability warningsKarl Williamson2020-01-231-2/+2
|
* Improve performance of grok_bin_oct_hex()Karl Williamson2020-01-131-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit uses a variety of techniques for speeding this up. It is now faster than blead, and has less maintenance cost than before. Most of the checks that the current character isn't NUL are unnecssary. The logic works on that character, even if, for some reason, you can't trust the input length. A special test is added to not output the illegal character message if that character is a NUL. This is simply for backcompat. And a switch statement is used to unroll the loop for the leading digits in the number. This should handle most common cases. Beyond these, and one has to start worrying about overflow. So this version has removed that worrying from the common cases. Extra conditionals are avoided for large numbers by extracting the portability warning message code into a separate static function called from two different places. Simplifying this logic led me to see that if it overflowed, it must be non-portable, so another conditional could be removed. Other conditionals were removed at the expense of adding parameters to the function. This function isn't public, but is called from the grok_hex, et. al. macros. grok_hex knows, for example, that it is looking for an 'x' prefix and not a 'b'. Previously the code had a conditional to determine that. Similarly in pp.c, we look for the prefix. Having found it we can start the parse after the prefix, and tell this function not to look for it. Previously, this work was duplicated. The previous changes had left this function slower than blead. That is in part due to the fact that the loop doesn't go through that many iterations per function call, and the gcc compiler managed to optimize away the conditionals in XDIGIT_VALUE in the call of it from the loop. (The other call in this function did have the conditionals.) Thanks to Sergey Aleynikov for his help on this
* Collapse grok_bin, _oct, _hex into one functionKarl Williamson2020-01-131-3/+1
| | | | | | | | | | These functions are identical in logic in the main loop, the difference being which digits they accept. The rest of the code had slight variations. This commit unifies the functions. I presume the reason they were kept separate was because of speed. Future commits will make this unified function faster than blead, and the reduced maintenance cost makes this worthwhile.
* Rmv leading underscore from macro nameKarl Williamson2019-12-111-2/+2
| | | | | | | These are illegal in C, but we have plenty of them around; I happened to be looking at this function, and decided to fix it. Note that only the macro name is illegal; the function was fine, but to change the macro name means changing the function one.
* Add the `isa` operatorPaul "LeoNerd" Evans2019-12-091-0/+2
| | | | | | | | | | | | | | | | | | Adds a new infix operator named `isa`, with the semantics that $x isa SomeClass is true if and only if `$x` is a blessed object reference that is either `SomeClass` directly, or includes the class somewhere in its @ISA hierarchy. It is false without warning or error for non-references or non-blessed references. This operator respects `->isa` method overloading, and is intended to replace boilerplate code such as use Scalar::Util 'blessed'; blessed($x) and $x->isa("SomeClass")
* PATCH: gh #17275 Silence new warningKarl Williamson2019-11-211-4/+6
| | | | | | | | | This was caused by a static inline function in a header that was #included in a file that didn't use it. Normally, these functions are #ifdef'd so as to be visible only to files in which they are used. Some compilers warn that the function is defined but not used otherwise. The solution is to remove this function's visibility from the file that didn't use it.
* regcomp.c: Add invlist_lowest() and use itKarl Williamson2019-11-201-0/+1
| | | | | | This makes it less complicated to find the lowest code point in an inversion list. This makes the place where it's used clearer as to what is going on. And it may eventually be used in more than one place.
* find_first_differing_byte_posKarl Williamson2019-11-201-0/+1
|
* add explicit 1-arg and 3-arg sig handler functionsDavid Mitchell2019-11-181-0/+4
| | | | | | | Currently, whether the OS-level signal handler function is declared as 1-arg or 3-arg depends on the configuration. Add explicit versions of these functions, principally so that POSIX.xs can call which version of the handler it wants regardless of configuration: see next commit.
* add a new function, Perl_perly_sighandler()David Mitchell2019-11-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function implements the body of what used to be Perl_sighandler(), the latter becoming a thin wrapper round Perl_perly_sighandler(). The main reason for this change is that it allows us to add an extra arg, 'safe' to the function without breaking backcompat. This arg indicates whether the function is being called directly from the OS signal handler (safe==0), or deferred via Perl_despatch_signals() (safe==1). This allows an infelicity in the code to be fixed - it was formerly trying to determine the route it had been called by (and hence whether a 'safe' route) by seeing if either of the sig/uap parameters was non-null. It turns out that this was highly dogdy, and only worked by luck. The safe caller did indeed pass NULL args, but due to a bug (shortly to be fixed), sometimes the kernel thinks its calling a 1-arg sig handler when its actually calling a 3-arg one. This means that the sig/uap args are random garbage, and happen to be non-zero only by happy coincidence on the OS/platforms so far. Also, it turns out that the call via Perl_csighandler() was getting it wrong: its explicit (NULL,NULL) args made it look like a safe signal call. This will be corrected in the next commit, but for this commit the old wrong behaviour is preserved. See RT #82040 for details of when/why the original dodgy 'safe' check was added.
* add PERL_USE_3ARG_SIGHANDLER macroDavid Mitchell2019-11-181-10/+10
| | | | | | | | | | | | | | | | | | | | | | There are a bunch of places in core that do #if defined(HAS_SIGACTION) && defined(SA_SIGINFO) to decide whether the C signal handler function should be declared with, and called with, 1 arg or 3 args. This commit just adds #if defined(HAS_SIGACTION) && defined(SA_SIGINFO) # define PERL_USE_3ARG_SIGHANDLER #endif Then uses the new macro in all other places rather than checking HAS_SIGACTION and SA_SIGINFO. Thus there is no functional change; it just makes the code more readable. However, it turns out that all is not well with core's use of 1-arg versus 3-arg, and the few commits will fix this.
* regcomp.c: Add parameter to static functionKarl Williamson2019-11-171-1/+1
| | | | | This further decouples this function from knowing details of the calling structure, by passing this detail in.
* clean up quadmath_format_*() functionsTony Cook2019-11-161-4/+4
| | | | | | | | | | | This includes: - remove them from the API - simplify quadmath_format_single()'s interface, and rename it to match the new interface fixes #17288
* Remove swashes from coreKarl Williamson2019-11-061-4/+0
| | | | Also references to the term.
* Reimplement tr/// without swashesKarl Williamson2019-11-061-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large commit removes the last use of swashes from core. It replaces swashes by inversion maps. This data structure is already in use for some Unicode properties, such as case changing. The inversion map data structure leads to straight forward implementation code, so I collapsed the two doop.c routines do_trans_complex_utf8() and do_trans_simple_utf8() into one. A few conditionals could be avoided in the loop if this function were split so that one version didn't have to test for, e.g., squashing, but I suspect these are in the noise in the loop, which has to deal with UTF-8 conversions. This should be faster than the previous implementation anyway. I measured the differences some releases back, and inversion maps were faster than the equivalent swash for up to 512 or 1024 different ranges. These numbers are unlikely to be exceeded in tr/// except possibly in machine-generated ones. Inversion maps are capable of handling both UTF-8 and non-UTF-8 cases, but I left in the existing non-UTF-8 implementation, which uses tables, because I suspect it is faster. This means that there is extra code, purely for runtime performance. An inversion map is always created from the input, and then if the table implementation is to be used, the table is easily derived from the map. Prior to this commit, the table implementation was used in certain edge cases involving code points above 255. Those cases are now handled by the inversion map implementation, because it would have taken extra code to detect them, and I didn't think it was worth it. That could be changed if I am wrong. Creating an inversion map for all inputs essentially normalizes them, and then the same logic is usable for all. This fixes some false negatives in the previous implementation. It also allows for detecting if the actual transliteration can be done in place. Previously, the code mostly punted on that detection for the UTF-8 case. This also allows for accurate counting of the lengths of the two sides, fixing some longstanding TODO warning tests. A new flag is created, OPpTRANS_CAN_FORCE_UTF8, when the tr/// has a below 256 character resolving to one that requires UTF-8. If this isn't set, the code knows that a non-UTF-8 input won't become UTF-8 in the process, and so can take short cuts. The bit representing this flag is the same as OPpTRANS_FROM_UTF, which is no longer used. That name is left in so that the dozen-ish modules in cpan that refer to it can still compile. AFAICT none of them actually use the flag, as well they shouldn't since it is private to the core. Inversion maps are ideally suited for tr/// implementations. An issue with them in general is that for some pathological data, they can become fragmented requiring more space than you would expect, to represent the underlying data. However, the typical tr/// would not have this issue, requiring only very short inversion maps to represent; in some cases shorter than the table implementation. Inversion maps are also easier to deparse than swashes. A deparse TODO was also fixed by this commit, and the code to deparse UTF-8 inputs is simplified. One could implement specialized data structures for specific types of inputs. For example, a common tr/// form is a single range, like tr/A-Z/a-z/. That could be implemented without a table and be quite fast. An intermediate step would be to use the inversion map implementation always when the transliteration is a single range, and then special case length=1 maps at execution time. Thanks to Nicholas Rochemagne for his help on B
* op.c: Add debugging dump functionKarl Williamson2019-11-061-1/+2
| | | | This function dumps out an inversion map
* doop.c: Add a parameter to a few fcnsKarl Williamson2019-11-061-3/+3
| | | | | instead of deriving it each time from inside the function. This is in preparation for future commits.
* Move some static fcns from regcomp.c to invlist_inline.hKarl Williamson2019-11-061-4/+4
| | | | | They are still only accessible from regcomp.c, but this is in preparation for them to be usable from other core files as well.
* regcomp.c: Change name of static function.Karl Williamson2019-11-061-1/+1
| | | | This removes an unnecessary leading underscore
* Factor out common code from sv_derived_from_* subs familySergey Aleynikov2019-11-041-2/+5
| | | | | into one that takes both SV*/char*+len arguments, like hv_common, to be able to use speedups from SV* stash lookup API.
* PATCH: Character class code broke MSWin32 compilationKarl Williamson2019-11-031-99/+0
| | | | | | | | | I'm not sure why this didn't show up elsewhere, but we have embed.fnc entries for non-existent functions that should have been removed in dd1a3ba7882ca70c1e85b0fd6c03d07856672075. In addition, I see several more functions that should have been removed, and this commit removes them.
* Rmv more deprecated characlassify/case change macrosKarl Williamson2019-10-311-13/+4
| | | | These were missed by 059703b088f44d5665f67fba0b9d80cad89085fd.
* Remove deprecated character classification/case changing macrosKarl Williamson2019-09-291-76/+9
| | | | | | | | | | | | | | It has been deprecated since 5.26 to use various macros that deal with UTF-8 inputs but don't have a parameter indicating the maximum length beyond which we should not look. This commit changes all such macros, as threatened in existing documentation and warning messages, to have an extra parameter giving the length. This was originally scheduled to happen in 5.30, but was delayed because it broke some CPAN modules, and there wasn't really a good way around it. But now that Devel::PPPort 3.54 is out, ppport.h has new facilities for getting modules making these changes to work with older Perl releases.
* Un-revert "[MERGE] add+use si_cxsubix field"David Mitchell2019-09-231-0/+1
| | | | | | | | original merge commit: v5.31.3-198-gd2cd363728 reverted by: v5.31.4-0-g20ef288c53 The commit following this commit fixes the breakage, which that means the revert can be undone.
* Revert "[MERGE] add+use PL_curstackinfo->si_cxsubix field"v5.31.4Max Maischein2019-09-201-1/+0
| | | | | | | | | | | | This reverts commit d2cd363728088adada85312725ac9d96c29659be, reversing changes made to 068b48acd4bdf9e7c69b87f4ba838bdff035053c. This change breaks installing Test::Deep: ... not ok 37 - Test 'isa eq' completed ok 38 - Test 'isa eq' no premature diagnostication ...
* add Perl_gimme_V() static inline fn for GIMME_VDavid Mitchell2019-09-191-0/+1
| | | | | | This function makes use of PL_curstackinfo->si_cxsubix to avoid the overhead of a call to block_gimme() when the context of the op is unknown.