summaryrefslogtreecommitdiff
path: root/embed.h
Commit message (Collapse)AuthorAgeFilesLines
* (perl #129000) create a safer utf8_hop()Tony Cook2016-11-091-0/+3
| | | | | | | | | | | | | | Unlike utf8_hop(), utf8_hop_safe() won't navigate before the beginning or after the end of the supplied buffer. The original version of this put all of the logic into utf8_hop_safe(), but in many cases a caller specifically needs to go forward or backward, and supplying the other limit made the function less usable, so I split the function into forward and backward cases. This split may also make inlining these functions more efficient or more likely.
* Fix wrong UTF-8 overflow error on 32-bit platformsKarl Williamson2016-11-021-0/+1
| | | | | | | | Commit 2b5e7bc2e60b4c4b5d87aa66e066363d9dce7930 changed the algorithm for detecting overflow during decoding UTF-8 into code points. However, on 32-bit platforms, this change caused it to claim some things overflow that really don't. ALl such are overlong malformations, which are normally forbidden, but not necessarily. This commit fixes that.
* make regen and args assert fixYves Orton2016-10-191-4/+4
|
* Add a way to have functions with a trailing depth argument under debuggingYves Orton2016-10-191-3/+4
| | | | | | | | | | In the regex engine it can be useful in debugging mode to maintain a depth counter, but in normal mode this argument would be unused. This allows us to define functions in embed.fnc with a "W" flag which use _pDEPTH and _aDEPTH defines which effectively define/pass through a U32 depth parameter to the macro wrappers. These defines are similar to the existing aTHX and pTHX parameters.
* Add a regex_sets debugging functionKarl Williamson2016-10-191-0/+5
| | | | | | This is enabled by a C flag, as commented. It is designed to be found only by someone reading the code and wanting something temporary to help in debugging.
* regexec.c: in debug fixup indents and TRIE/BUFFER debug outputYves Orton2016-10-191-2/+2
|
* sv.c: add sv_setpv_bufsize() and SvPVCLEAR()Yves Orton2016-10-191-0/+1
| | | | | | | | | | The first can be used to wrap several SVPV steps into a single sub, and a wrapper macro which is the equivalent of $s= ""; but is optimized in various ways.
* Add utf8n_to_uvchr_errorKarl Williamson2016-10-131-1/+1
| | | | | | | | | | This new function behaves like utf8n_to_uvchr(), but takes an extra parameter that points to a U32 which will be set to 0 if no errors are found; otherwise each error found will set a bit in it. This can be used by the caller to figure out precisely what the error(s) is/are. Previously, one would have to capture and parse the warning/error messages raised. This can be used, for example, to customize the messages to the expected end-user's knowledge level.
* utf8.c: Extract some code into 2 functionsKarl Williamson2016-10-131-0/+2
| | | | | This is in preparation for the same functionality to each be used in a new place in a future commit
* Add details to UTF-8 malformation error messagesKarl Williamson2016-10-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | I've long been unsatisfied with the information contained in the error/warning messages raised when some input is malformed UTF-8, but have been reluctant to change the text in case some one is relying on it. One reason that someone might be parsing the messages is that there has been no convenient way to otherwise pin down what the exact malformation might be. A few commits from now will add a facility to get the type of malformation unambiguously. This will be a better mechanism to use for those rare modules that need to know what's the exact malformation. So, I will fix and issue pull requests for any module broken by this commit. The messages are changed by now dumping (in \xXY format) the bytes that make up the malformed character, and extra details are added in most cases. Messages about overlongs now display the code point they evaluate to and what the shortest UTF-8 sequence for generating that code point is. Messages about overflowing now just display that it overflows, since the entire byte sequence is now dumped. The previous message displayed just the byte which was being processed where overflow was detected, but that information is not at all meaningfull.
* utf8.c: Consolidate duplicate error msg textKarl Williamson2016-10-131-0/+1
| | | | This text is generated in 2 places; consolidate into one place.
* Add is_utf8_fixed_width_buf_flags() and use itKarl Williamson2016-09-251-0/+1
| | | | | | | | This encodes a simple pattern that may not be immediately obvious to someone needing it. If you have a fixed-size buffer that is full of purportedly UTF-8 bytes, is it valid or not? It's easy to do, as shown in this commit. The file test operators -T and -B can be simpified by using this function.
* Add API Unicode handling functionsKarl Williamson2016-09-251-0/+6
| | | | | | | | | | These functions are all extensions of the is_utf8_string_foo() functions, that restrict the UTF-8 recognized as valid in various ways. There are named ones for the two definitions that Unicode makes, and foo_flags ones for more custom restrictions. The named ones are implemented as tries, while the flags ones provide complete generality
* Add is_utf8_valid_partial_char_flags()Karl Williamson2016-09-171-1/+1
| | | | | This is a generalization of is_utf8_valid_partial_char to allow the caller to automatically exclude things such as surrogates.
* Enhance and rename is_utf8_char_slow()Karl Williamson2016-09-171-1/+1
| | | | | | | This changes the name of this helper function and adds a parameter and functionality to allow it to exclude problematic classes of code points, the same ones excludeable by utf8n_to_uvchar(), like surrogates or non-character code points.
* utf8.c: Extract duplicate code to common fcnKarl Williamson2016-09-171-0/+1
| | | | | | Actually the code isn't quite duplicate, but should be because one instance is wrong. This failure would only show up on EBCDIC platforms. Tests are coming in a future commit.
* Fix checks for tainted dir in $ENV{PATH}Father Chrysostomos2016-09-031-0/+1
| | | | | | | | | | | | | | | | | | | $ cat > foo #!/usr/bin/perl print "What?!\n" ^D $ chmod +x foo $ ./perl -Ilib -Te '$ENV{PATH}="."; exec "foo"' Insecure directory in $ENV{PATH} while running with -T switch at -e line 1. That is what I expect to see. But: $ ./perl -Ilib -Te '$ENV{PATH}="/\\:."; exec "foo"' What?! Perl is allowing the \ to escape the :, but the \ is not treated as an escape by the system, allowing a relative path in PATH to be consid- ered safe.
* Add is_utf8_valid_partial_char()Karl Williamson2016-08-311-0/+1
| | | | | | | | | This new function can test some purported UTF-8 to see if it is well-formed as far as it goes. That is there aren't enough bytes for the character they start, but what is there is legal so far. This can be useful in a fixed width buffer, where the final character is split in the middle, and we want to test without waiting for the next read that the entire buffer is valid.
* Move isUTF8_CHAR helper function, and reimplement itKarl Williamson2016-08-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | The macro isUTF8_CHAR calls a helper function for code points higher than it can handle. That function had been an inlined wrapper around utf8n_to_uvchr(). The function has been rewritten to not call utf8n_to_uvchr(), so it is now too big to be effectively inlined. Instead, it implements a faster method of checking the validity of the UTF-8 without having to decode it. It just checks for valid syntax and now knows where the few discontinuities are in UTF-8 where overlongs can occur, and uses a string compare to verify that overflow won't occur. As a result this is now a pure function. This also causes a previously generated deprecation warning to not be, because in printing UTF-8, no longer does it have to be converted to internal form. I could add a check for that, but I think it's best not to. If you manipulated what is getting printed in any way, the deprecation message will already have been raised. This commit also fleshes out the documentation of isUTF8_CHAR.
* Inline is_utf8_invariant_string()Karl Williamson2016-08-311-1/+1
|
* Add new synonym 'is_utf8_invariant_string'Karl Williamson2016-08-311-1/+1
| | | | | | This is clearer as to its meaning than the existing 'is_ascii_string' and 'is_invariant_string', which are retained for back compat. The thread context variable is removed as it is not used.
* Document valid_utf8_to_uvchr() and inline itKarl Williamson2016-08-311-1/+1
| | | | | | | | This function has been in several releases without problem, and is short enough that some compilers can inline it. This commit also notes that the result should not be ignored, and removes the unused pTHX. The function has explicitly been marked as being changeable, and has not been part of the API until now.
* signatures: eliminate XSIGVAR, add KEY_sigvarDavid Mitchell2016-08-181-1/+1
| | | | | | | | | | | | | | When I moved subroutine signature processing into perly.y with v5.25.3-101-gd3d9da4, I added a new lexer PL_expect state, XSIGVAR. This indicated, when about to parse a variable, that it was a signature element rather than a my variable; in particular, it makes ($,...) be toked as the lone sigil '$' rather than the punctuation variable '$,'. However this is a bit heavy-handled; so instead this commit adds a new allowed pseudo-keyword value to PL_in_my: as well as KEY_my, KEY_our and KEY_state, it can now be KEY_sigvar. This is a less intrusive change to the lexer.
* Rationalise gv.c:gv_magicalizeFather Chrysostomos2016-08-041-1/+1
| | | | | | | | | | | | | | | | | | | This code is confusing, and confusion has resulted in bugs. So rework the code a bit to make it more comprehensible. gv_magicalize no longer has such an arcane return value. What it does now is simply add appropriate magic to (or do appropriate vivification in) the GV passed as its argument. It returns true if magic or vivi- fication was applicable. The caller (gv_fetchpvn_flags) uses that return value to determine whether to install the GV in the symbol table, if the caller has requested that a symbol only be added if it is a magical one (GV_ADDMG). This reworking does mean that the GV is now checked for content even when it would make no difference, but I think the resulting clarity (ahem, *relative* clarity) of the code is worth it.
* gv.c:require_tie_mod: Accept pvn paramsFather Chrysostomos2016-08-041-1/+1
| | | | | | All the callers create the SV on the fly. We might as well put the SV creation into the function itself. (A forthcoming commit will refactor things to avoid the SV when possible.)
* Rework mod loading for %- and %!; fix mem leakFather Chrysostomos2016-08-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many built-in variables that perl creates on demand for efficiency’s sake. gv_fetchpvn_flags (which is responsible for sym- bol lookup) will fill in those variables automatically when add- ing a symbol. The special GV_ADDMG flag passed to this function by a few code paths (such as defined *{"..."}) tells gv_fetchpvn_flags to add the symbol, but only if it is one of the ‘magical’ built-in variables that we pre- tend already exist. To accomplish this, when the GV_ADDMG flag is passed, gv_fetchpvn_flags, if the symbol does not already exist, creates a new GV that is not attached to the stash. It then runs it through its magicalization code and checks afterward to see whether the GV changed. If it did, then it gets added to the stash. Otherwise, it is discarded. Three of the variables, %-, %!, and $], are problematic, in that they are implemented by external modules. gv_fetchpvn_flags loads those modules, which tie the variable in question, and then control is returned to gv_fetchpvn_flags. If it has a GV that has not been installed in the symbol table yet, then the module will vivify that GV on its own by a recursive call to gv_fetchpvn_flags (with the GV_ADD flag, which does none of this temporary-dangling-GV stuff), and gv_fetchpvn_flags will have a separate one which, when installed, would clobber the one with the tied variable. We solved that by having the GV installed right before calling the module, for those three variables (in perl 5.16). The implementation changed in commit v5.19.3-437-g930867a, which was supposed to clean up the code and make it easier to follow. Unfortun- ately there was a bug in the implementation. It tries to install the GV for those cases *before* the magicalization code, but the logic is wrong. It checks to see whether we are adding only magical symbols (addmg) and whether the GV has anything in it, but before anything has been added to the GV. So the symbol never gets installed. Instead, it just leaks, and the one that the implementing module vivifies gets used. This leak can be observed with XS::APItest::sv_count: $ ./perl -Ilib -MXS::APItest -e 'for (1..10){ defined *{"!"}; delete $::{"!"}; warn sv_count }' 3833 at -e line 1. 4496 at -e line 1. 4500 at -e line 1. 4504 at -e line 1. 4508 at -e line 1. 4512 at -e line 1. 4516 at -e line 1. 4520 at -e line 1. 4524 at -e line 1. 4528 at -e line 1. Perl 5.18 does not exhibit the leak. So in this commit I am finally implementing something that was dis- cussed about the time that v5.19.3-437-g930867a was introduced. To avoid the whole problem of recursive calls to gv_fetchpvn_flags vying over whose GV counts, I have stopped the implementing modules from tying the variables themselves. Instead, whichever gv_fetchpvn_flags call is trying to create the glob is now responsible for seeing that the variable is tied after the module is loaded. Each module now pro- vides a _tie_it function that gv_fetchpvn_flags can call. One remaining infelicity is that Errno mentions $! in its source, so *! will be vivified when it is loading, only to be clobbered by the GV subsequently installed by gv_fetch_pvn_flags. But at least it will not leak. One test that failed as a result of this (in t/op/magic.t) was try- ing to undo the loading of Errno.pm in order to test it afresh with *{"!"}. But it did not remove *! before the test. The new logic in the code happens to work in such a way that the tiedness of the vari- able determines whether the module needs to be loaded (which is neces- sary, now that the module does not tie the variable). Since the test is by no means normal code, it seems reasonable to change it.
* add OP_ARGELEM, OP_ARGDEFELEM, OP_ARGCHECK opsDavid Mitchell2016-08-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently subroutine signature parsing emits many small discrete ops to implement arg handling. This commit replaces them with a couple of ops per signature element, plus an initial signature check op. These new ops are added to the OP tree during parsing, so will be visible to hooks called up to and including peephole optimisation. It is intended soon that the peephole optimiser will take these per-element ops, and replace them with a single OP_SIGNATURE op which handles the whole signature in a single go. So normally these ops wont actually get executed much. But adding these intermediate-level ops gives three advantages: 1) it allows the parser to efficiently generate subtrees containing individual signature elements, which can't be done if only OP_SIGNATURE or discrete ops are available; 2) prior to optimisation, it provides a simple and straightforward representation of the signature; 3) hooks can mess with the signature OP subtree in ways that make it no longer possible to optimise into an OP_SIGNATURE, but which can still be executed, deparsed etc (if less efficiently). This code: use feature "signatures"; sub f($a, $, $b = 1, @c) {$a} under 'perl -MO=Concise,f' now gives: d <1> leavesub[1 ref] K/REFC,1 ->(end) - <@> lineseq KP ->d 1 <;> nextstate(main 84 foo:6) v:%,469762048 ->2 2 <+> argcheck(3,1,@) v ->3 3 <;> nextstate(main 81 foo:6) v:%,469762048 ->4 4 <+> argelem(0)[$a:81,84] v/SV ->5 5 <;> nextstate(main 82 foo:6) v:%,469762048 ->6 8 <+> argelem(2)[$b:82,84] vKS/SV ->9 6 <|> argdefelem(other->7)[2] sK ->8 7 <$> const(IV 1) s ->8 9 <;> nextstate(main 83 foo:6) v:%,469762048 ->a a <+> argelem(3)[@c:83,84] v/AV ->b - <;> ex-nextstate(main 84 foo:6) v:%,469762048 ->b b <;> nextstate(main 84 foo:6) v:%,469762048 ->c c <0> padsv[$a:81,84] s ->d The argcheck(3,1,@) op knows the number of positional params (3), the number of optional params (1), and whether it has an array / hash slurpy element at the end. This op is responsible for checking that @_ contains the right number of args. A simple argelem(0)[$a] op does the equivalent of 'my $a = $_[0]'. Similarly, argelem(3)[@c] is equivalent to 'my @c = @_[3..$#_]'. If it has a child, it gets its arg from the stack rather than using $_[N]. Currently the only used child is the logop argdefelem. argdefelem(other->7)[2] is equivalent to '@_ > 2 ? $_[2] : other'. [ These ops currently assume that the lexical var being introduced is undef/empty and non-magival etc. This is an incorrect assumption and is fixed in a few commits' time ]
* make op.c:S_alloc_LOGOP() non-staticDavid Mitchell2016-08-031-0/+1
| | | | This is principally so that it can be accessed from perly.y too.
* sub signatures: use parser rather than lexerDavid Mitchell2016-08-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the signature of a sub (i.e. the '($a, $b = 1)' bit) is parsed in toke.c using a roll-your-own mini-parser. This commit makes the signature be part of the general grammar in perly.y instead. In theory it should still generate the same optree as before, except that an OP_STUB is no longer appended to each signature optree: it's unnecessary, and I assume that was a hangover from early development of the original signature code. Error messages have changed somewhat: the generic 'Parse error' has changed to the generic 'syntax error', with the addition of ', near "xyz"' now appended to each message. Also, some specific error messages have been added; for example (@a=1) now says that slurpy params can't have a default vale, rather than just giving 'Parse error'. It introduces a new lexer expect state, XSIGVAR, since otherwise when the lexer saw something like '($, ...)' it would see the identifier '$,' rather than the tokens '$' and ','. Since it no longer uses parse_termexpr(), it is no longer subject to the bug (#123010) associated with that; so sub f($x = print, $y) {} is no longer mis-interpreted as sub f($x = print($_, $y)) {}
* locale.c: Add some DEBUG statementsKarl Williamson2016-08-021-0/+3
| | | | | This also involves moving some complicated debugging statements to a separate function so that it can be called from more than one place
* regcomp.c: Silence compiler warningKarl Williamson2016-07-201-2/+2
| | | | These functions are no longer needed in re_comp.c
* regcomp.c: Silence compiler warningKarl Williamson2016-07-171-2/+2
| | | | | | It turns out that the changes in 0854ea0b9abfd9ff71c9dca1b5a5765dad2a20bd caused two functions to no longer be used in re_comp.c
* Fix -Dr output regressionKarl Williamson2016-07-161-1/+1
| | | | | | | | | | | | | | | | | Several commits in the 5.23 series improved the display of the compiled ANYOF regnodes, but introduced two bugs. One of them is in \p{Any} and similar things that match the entire range 0-255. That range is omitted, so it looks like \p{Any} only matches code points above 255. Note that this is only what gets displayed under -Dr. What actually gets compiled has been and still is fine. The other is that when displaying a pattern that still has unresolved user-defined properties that are complemented, it doesn't show properly that the whole thing is complemented. That is, the output looks like it doesn't obey De Morgan's laws. The fixes to these are quite intertwined, and so I didn't try to separate them.
* Remove mg.c:_get_encodingFather Chrysostomos2016-07-131-1/+0
| | | | Nothing uses it now, and it does nothing.
* Remove IN_ENCODING macro, and all code dependent on itFather Chrysostomos2016-07-131-1/+0
|
* [perl #128478] Restore former "$foo::$bar" parsingFather Chrysostomos2016-06-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function scan_word, in toke.c, is used to parse barewords. The scan_ident function is used to scan an identifier after a sigil. Prior to v5.17.9-108-g07f7264, both functions had their own parsing loops, and scan_ident actually had two, one for $foo and another for ${foo}. The state purpose of 07f7264 was to fix discrepancies in the parsing of $foo vs ${foo}, by making the two forms use the same parsing code. In accomplishing this, the commit in question merged not only the two loops in scan_ident, but all three loops, including the one in scan_word, by introducing a new function, parse_ident, that the others call. One result was that some logic appropriate only to scan_word started to be applied also to scan_ident; namely, that ::$ would be explicitly checked for and disallowed (the parsing would stop before the ::), for the sake of the “Bad name after Foo::” error. The consequence was that "$foo::$bar" started to be parsed as $foo."::".$bar, instead of $foo:: . $bar, as previously. Now, "$foo::@bar" was unaffected, so by fixing one form of inconsis- tency we ended up form, including B::Deparse bugs (because B::Deparse was not consistent with the core). This commit restores the previous behaviour by giving parse_ident an extra parameter, making the ::$ check optional.
* Change scalar(%hash) to be the same as 0+keys(%hash)Yves Orton2016-06-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This subject has a long history see [perl #114576] for more discussion. https://rt.perl.org/Public/Bug/Display.html?id=114576 There are a variety of reasons we want to change the return signature of scalar(%hash). One is that it leaks implementation details about our associative array structure. Another is that it requires us to keep track of the used buckets in the hash, which we use for no other purpose but for scalar(%hash). Another is that it is just odd. Almost nothing needs to know these values. Perhaps debugging, but we have several much better functions for introspecting the internals of a hash. By changing the return signature we can remove all the logic related to maintaining and updating xhv_fill_lazy. This should make hot code paths a little faster, and maybe save some memory for traversed hashes. In order to provide some form of backwards compatibility we adds three new functions to the Hash::Util namespace: bucket_ratio(), num_buckets() and used_buckets(). These functions are actually implemented in universal.c, and thus always available even if Hash::Util is not loaded. This simplifies testing. At the same time Hash::Util contains backwards compatible code so that the new functions are available from it should they be needed in older perls. There are many tests in t/op/hash.t that are more or less obsolete after this patch as they test that xhv_fill_lazy is correctly set in various situations. However since we have a backwards compat layer we can just switch them to use bucket_ratio(%hash) instead of scalar(%hash) and keep the tests, just in case they are actually testing something not tested elsewhere.
* Prepare for Unicode 9.0Karl Williamson2016-06-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
* mv function from locale.c to mathoms.cKarl Williamson2016-05-241-1/+3
| | | | | | The previous commit causes this function being moved to be just a wrapper not called in core. Just in case someone is calling it, it is retained, but moved to mathoms.c
* Do better locale collation in UTF-8 localesKarl Williamson2016-05-241-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some platforms, the libc strxfrm() works reasonably well on UTF-8 locales, giving a default collation ordering. It will assume that every string passed to it is in UTF-8. This commit changes Perl to make sure that strxfrm's expectations are met. Likewise under a non-UTF-8 locale, strxfrm is expecting a non-UTF-8 string. And this commit makes sure of that as well. So, simply meeting strxfrm's expectations allows Perl to start supporting default collation in UTF-8 locales, and fixes it to work on single-byte locales with UTF-8 input. (Unicode::Collate provides tailorable functionality and is portable to platforms where strxfrm isn't as intelligent, but is a much more heavy-weight solution that may not be needed for particular applications.) There is a problem in non-UTF-8 locales if the passed string contains code points representable only in UTF-8. This commit causes them to be changed, before being passed to strxfrm, into the highest collating character in the locale that doesn't require UTF-8. They then will sort the same as that character, which means after all other characters in the locale but that one. In strings that don't have that character, this will generally provide exactly correct operation. There still is a problem, if that character, in the given locale, combines with adjacent characters to form a specially weighted sequence. Then, the change of these above-255 code points into that character can skew the results. See the commit message for 6696cfa7cc3a0e1e0eab29a11ac131e6f5a3469e for more on this. But it is really an illegal situation to have above-255 code points in a single-byte locale, so this behavior is a reasonable degradation when given illegal input. If two transformed strings compare exactly equal, Perl already uses the un-transformed versions to break ties, and there, these faked-up strings will collate so the above-255 code points sort after everything else, and in code point order amongst themselves.
* Use memmem() if available on the platform for Perl_ninstr()Karl Williamson2016-05-121-1/+3
|
* Make two functions for 5.005 backcompat MATHOMSKarl Williamson2016-05-121-0/+4
| | | | | | | The functions sv_setpviv() and sv_setpviv_mg() exist only for backcompat with 5.005. (verified with Jarkko and Nicholas). Make them compile only when mathoms is present. They can't be moved to mathoms.c because they both call a static function visible only in sv.c.
* embed.fnc: Alter 'b' flag meaningKarl Williamson2016-05-121-3/+5
| | | | | | | | | | | | | This commit changes this flag to mean that the backward compatibility functions are compiled unless the -DNO_MATHOMS cflag is specified to Configure. Previously the meaning was sort of like that but not precisely. Doing this means that the prototypes that needed to be manually added to mathoms.c are no longer needed. No special parameter assertions have to be made. makedef.pl no longer needs to parse mathoms.c and have special cases for it. And several special case entries in embed.fnc can be non-special cased.
* Reinstate "Make instr() a macro"Karl Williamson2016-05-091-1/+0
| | | | | | This reverts commit 2e08dfb2b133af0fbcb4346f8d096ca68454ca54, thus reinstating the commit it reverted. This was delayed until 5.25. The next commit will solve some problems with c++ that this commit causes
* Revert "Make instr() a macro"Karl Williamson2016-04-081-0/+1
| | | | | | This reverts commit fea1d2dd5d210564d442a09fe034b62f262f35f9 due to it causing problems so close to the release of 5.24. See https://rt.perl.org/Ticket/Display.html?id=127852
* Get -Accflags=-DPERL_MEM_LOG compiling againMatthew Horsfall2016-04-051-0/+5
| | | | | | | | It had rotted a bit Well, more than one probably. Move the declarations of the functions Perl_mem_log_alloc etc from handy.h into embed.fnc where whey belong, and where Malloc_t will have already been defined.
* Make instr() a macroKarl Williamson2016-03-171-1/+0
| | | | ... thus avoiding a function call overhead
* fixup definitions and usage of new re debugging subsYves Orton2016-03-131-0/+9
| | | | | | | | this should fix the smoke failures on threaded builds, also it renames re_indentfo which was a terrible name in the first place, and now what i have had to strip the Perl_prefixes from these subs with a perl -i -pe, I took the opportunity to rename it to re_exec_indent, which self documents much better.
* Rework diagnostics in the regex engineYves Orton2016-03-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces three new subs: Perl_re_printf() which is a wrapper for PerlIO_printf( Perl_debug_log, ... ), which cuts down on clutter in the code. Arguably this could be moved to util.c and renamed something like PerlIO_debugf() and then we could declutter all the statements that write to the Perl_debug_log filehandle. But that is a bit too ambituous for me right now, so I leave this as a regex engine only sub for now. Perl_re_indentf() which is a wrapper for PerlIO_re_printf(), which adds an indent argument and automatically indents the line appropriately, and is used in regcomp.c for trace diagnostics during compilation. Perl_re_indentfo() which is similar to Perl_re_indentf() but is used in regexec.c which adds a specific prefix to each indented line to account for the fact that during execution we normally have string position information on the left. The end result of this patch is that a lot of clutter in the debugging statements in the regex engine is reduced, exposing what is actually going on. It should also now be easier to add new diagnostics which "do the right thing". Over time the debugging trace output in regexec has become very cluttered and confusing. This patch cleans much of it up, if something happens at a given recursion depth it is output at the right depth, etc, and formats have been changed to not have leading spaces so you can actually see the indentation properly.
* make building without memcpy work (RT #127619)Lukas Mai2016-03-071-3/+3
|