delta/perl.git - github.com: perl/perl5.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Silence -Wunused-parameter my_perl under threads.	Jarkko Hietaniemi	2014-06-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	For S_ functions, remove the context. For Perl_ functions, add PERL_UNUSED_CONTEXT. Tricky because sometimes depends on DEBUGGING, and sometimes on whether we are have PERL_IMPLICIT_SYS. (Why all the mathoms Perl_is_uni_... and Perl_is_utf8_... functions are not being whined about is a mystery.) vutil.c (included via util.c) has one of these, but it's cpan/, and a known problem: https://rt.cpan.org/Ticket/Display.html?id=96100
*	PATCH: [perl #122084] BBC KARMAN/Search-Tools	Karl Williamson	2014-06-16	1	-1/+1
\| \| \| \| \|	The problem was that a function was defined only in PERL_CORE, and embed.fnc just needed to change to grant access outside that.
*	Remove MAD.	Jarkko Hietaniemi	2014-06-13	1	-68/+0
\| \| \| \| \| \|	MAD = Misc Attribute Decoration; unmaintained attempt at preserving the Perl parse tree more faithfully so that automatic conversion to Perl 6 would have been easier.
*	Deprecate unescaped literal "{" in regex patterns	Karl Williamson	2014-06-12	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit also causes escaped (by a backslash) "(", "[", and "{" to be considered literally. In the previous 2 Perl versions, the escaping was ignored, and a (default-on) deprecation warning was raised. Now that we have warned for 2 release cycles, we can change the meaning.of escaping to actually do something Warning when a literal left brace is not escaped by a backslash, will allow us to eventually use this character in more contexts as being meta, allowing us to extend the language. For example, the lower limit of a quantifier could be omited, and better error checking instituted, or things like \w could be followed by a {...} indicating some special word character, like \w{Greek} to restrict to just Greek word characters. We tried to do this in v5.16, and many CPAN modules changed to backslash their left braces at that time. However we had to back out that change before 5.16 shipped because it turned out that escaping a left brace in some contexts didn't work, namely when the brace would normally be a metacharacter (for example surrounding a quantifier), and the pattern delimiters were { }. Instead we raised the useless backslash warning mentioned above, which has now been there for the requisite 2 cycles. This patch partially reverts 2 patches. The first, e62d0b1335a7959680be5f7e56910067d6f33c1f, partially reverted the deprecation of unescaped literal left brace. The other, 4d68ffa0f7f345bc1ae6751744518ba4bc3859bd, instituted the deprecation of the useless left-characters. Note that, as in the original attempt to deprecate, we don't raise a warning if the left brace is the first character in the pattern. This is because in that position it can't be a metacharacter, so we don't require any disambiguation, and we found that if we did raise an error, there were quite a few places where this occurred.
*	Silence several -Wunused-parameter warnings about my_perl	Brian Fraser	2014-06-13	1	-26/+26
\| \| \| \| \| \| \| \|	This meant sprinkling some PERL_UNUSED_CONTEXT invocations, as well as stopping some functions from getting my_perl in the first place; all of the functions in the latter category are internal (S_ prefix and s or i in embed.fnc), so this should be both safe and economical.
*	Mark several functions with __attribute__noreturn__	Brian Fraser	2014-06-13	1	-5/+22
\| \| \| \| \|	Namely, die_nocontext, die, die_sv, and screaminstr. They all croak and never return, so let's mark them as non-returning.
*	Fix some compilation warnings	Karl Williamson	2014-06-12	1	-0/+2
\| \| \| \| \| \|	After commits d6ded95025185cb1ec8ca3ba5879cab881d8b180 and 130c5df3625bd130cd1e2771308fcd4eb66cebb2, there are some compilation warnings if not all locale categories are used.
*	remove 1 read of interp var from PUSHMARK	Daniel Dragan	2014-06-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PL_markstack_ptr was read once to do the ++ and comparison. Then after the markstack_grow call, or not, depending on the branch. The code reads PL_markstack_ptr a 2nd time. It has to be reread in case (or always does) markstack_grow reallocs the mark stack. markstack_grow has a void retval. That is a waste of a register. Let us put it to use to return the new PL_markstack_ptr. In markstack_grow the contents that will be assigned to PL_markstack_ptr are already in a register. So let the I32* flow out from markstack_grow to its caller. In VC2003 32 bit asm, mark_stack_entry is register eax. The retval of markstack_grow is in eax. So the assignment "=" in "mark_stack_entry = markstack_grow();" has no overhead. Since the other, not extend branch, is function call free, "(mark_stack_entry = ++PL_markstack_ptr)" assigns to eax. Ultimatly with this patch a 3 byte mov instruction is saved for each instance of PUSHMARK, and 1 interp var read is removed. I observed 42 callers of markstack_grow with my disassembler, so theoretically 3*42 bytes of machine code was removed for me. Perl_pp_pushmark dropped from 0x2b to 0x28 bytes of x86 VC 2003 machine code. [perl #122034]
*	Add C backtrace API.	Jarkko Hietaniemi	2014-06-07	1	-0/+6
\| \| \| \| \| \| \| \|	Useful for at least debugging. Supported in Linux and OS X (possibly to some extent in *BSD). See perlhacktips for details.
*	Use C locale for "$!" ouside 'use locale' scope	Karl Williamson	2014-06-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The stringification of $! has long been an outlier in Perl locale handling. The theory has been that these operating system messages are likely to be of use to the final user, and should be in their language. Things like No space left on device Can't fork are not something the program is likely to handle, but could be meaningfully helpful to the end-user. There are problems with this though. One is that many perl messages are in English, with the $! appended to them, so that the resultant message is of mixed language, and may need to be translated anyway. Things like No space left on device probably won't need the remaining portion of the message to give someone a clear indication as to what's wrong. But there are many other messages where both the OS error and the Perl error would be needed togther to understand the problem. An on-line translation tool can be used to do this. Another problem is that it can lead to garbage coming out on the user's terminal when the program is not expecting UTF-8, but the underlying locale is UTF-8. This is what happens in Bug #112208, and another that was merged with it. It's a lot harder to translate mojibake via an online tool than English. This commit solves that by using the C locale for messages, except within the scope of 'use locale'. It is extremely likely that the messages in the C locale will be English, but if not they will be ASCII, and there will be no garbage printed. A program that says "use locale" is indicating that it has the intelligence necessary to deal with locales.
*	Add parameters to "use locale"	Karl Williamson	2014-06-05	1	-0/+1
\| \| \| \| \| \| \|	This commit allows one to specify to enable locale-awareness for only a specified subset of the locale categories. Thus you could make a section of code LC_MESSAGES aware, with no locale-awareness for the other categories.
*	Set utf8 flag properly in localeconv	Karl Williamson	2014-06-05	1	-2/+3
\| \| \| \| \| \| \| \| \|	Rare, but not unheard of, is for the strings returned by localeconv to be in UTF-8. This commit looks for and sets the UTF-8 flag if they are. so encoded. A private function had to changed from static for this. It is renamed to begin with an underscore to emphasize its private nature.
*	regcomp.c - cleanup the ahocorasick start class logic so it more ↵	Yves Orton	2014-06-02	1	-3/+2
\| \| \| \| \| \| \| \| \|	self-documenting The logic of setting up an AHO-CORASICK regex start class was not fully encapsuated in the make_trie_failtable() function, which itself was poorly named. Merged the code into make_trie_failtable() and renamed it to construct_ahocorasick_from_trie().
*	Move some deprecated utf8-handling functions to mathoms	Karl Williamson	2014-05-31	1	-2/+6
\| \| \| \| \|	This entailed creating new internal functions for some of them to call so that the functionality can be retained during the deprecation period.
*	Make is_utf8_char_buf() a macro	Karl Williamson	2014-05-31	1	-1/+1
\| \| \| \| \| \|	This function is now more efficiently implemented as a synonym for isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code that calls it that way.
*	Create isUTF8_CHAR() macro and use it	Karl Williamson	2014-05-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This macro will inline the code to determine if a character is well-formed UTF-8 for code points below a certain value, falling back to a slower function for larger ones. On ASCII platforms, it will inline for well-beyond all legal Unicode code points. On EBCDIC, it currently does it for code points up to 0x3FFF. This could be increased, but our porting tests do the regen every time to make sure everything is ok, and making it larger slows that down. This is worked around on ASCII by normally commenting out the code that generates this info, but including in utf8.h a version that did get generated. This is static information and won't change. (This could be done for EBCDIC too, but I chose not to at this time as each code page has a different macro generated, and it gets ugly getting all of them in utf8.h) Using this macro allowed for simplification of several functions in utf8.c
*	utf8.c: Move a static function to inline.h	Karl Williamson	2014-05-31	1	-1/+1
\| \| \| \| \|	This is in preparation for it being called from outside utf8.c. It is renamed to have a leading underscore to emphasize its private nature
*	regcomp.c: Move code into a function	Karl Williamson	2014-05-30	1	-0/+2
\| \| \| \|	This is in preparation for it to be called from another place
*	regcomp.c, regexec.c: Move common code to a function	Karl Williamson	2014-05-30	1	-0/+1
\| \| \| \|	There are other cases where this functionality will be needed as well.
*	/x in patterns now includes all \p{PatWS}	Karl Williamson	2014-05-30	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This brings Perl regular expressions more into conformance with Unicode. /x now accepts 5 additional characters as white space. Use of these characters as literals under /x has been deprecated since 5.18, so now we are free to change what they mean. This commit eliminates the static function that processes the old whitespace definition (and a generated macro that was used only for this), using the already existing one for the new definition. It refactors slightly the static function that skips comments to mesh better with the needs of its callers, and calls it in one place where before the code was essentially duplicated. p5p discussion starting in http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that the (?[ ]) comments should be terminated the same way as regular /x comments, and this was also done in this commit. No prior notice is necessary as this is an experimental feature.
*	utf8.c: Move static function to embed.fnc	Karl Williamson	2014-05-29	1	-0/+3
\| \| \| \|	This automatically generates assertions for pointer arguments, etc.
*	Only prototype should_warn_nl under PERL_CORE.	Craig A. Berry	2014-05-29	1	-0/+2
\| \| \| \| \| \| \|	Otherwise the prototype is seen in places where the function itself is not defined, which is illegal for inline functions. Follow-up to 7cb3f9598b37fc.
*	[perl #121112] only warn if newline is the last non-NUL character	Tony Cook	2014-05-28	1	-0/+1
\|
*	Revert "[perl #79908] Stop sub inlining from breaking closures"	Ævar Arnfjörð Bjarmason	2014-05-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 137da2b05b4b7628115049f343163bdaf2c30dbb. See the "How about having a recommended way to add constant subs dynamically?" thread on perl5-porters, specifically while it sucks that we have this bug, it's been documented to work this way since 5.003 in "Constant Functions" in perlsub: If the result after optimization and constant folding is either a constant or a lexically-scoped scalar which has no other references, then it will be used in place of function calls made without C<&> -- http://perldoc.perl.org/perlsub.html#Constant-Functions Since we've had this documented bug for a long time we should introduce this fix in a deprecation cycle rather than silently slowing down code that assumes it's going to be optimized by constant folding. I didn't revert the tests it t/op/sub.t, but turned them into TODO tests instead. Conflicts: t/op/sub.t
*	[perl #121771] Revert the new warning for ++ on non- /\A[a-zA-Z]+[0-9]*\z/	Tony Cook	2014-05-07	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This failed as in it was producing: Argument "123abc" treated as 0 in increment (++) at -e line 1. when the user incremented that value (which is a lie). This reverts commits 8140a7a801e37d147db0e5a8d89551d9d77666e0 and 2cd5095e471e1d84dc9e0b79900ebfd66aabc909. I expect to revert this commit, and add fixes, after 5.20 is released. Conflicts: pod/perldiag.pod
*	Split Perl_do_openn() into Perl_do_open_raw() and Perl_do_open6().	Nicholas Clark	2014-03-19	1	-0/+5
\| \| \| \| \| \| \|	Perl_do_open_raw() handles the as_raw part of Perl_do_openn(). Perl_do_open6() handles the !as_raw part of Perl_do_openn(). do_open6() isn't a great name, but I can't see an obvious concise name that covers 2 arg open, 3 arg open, piped open, implicit fork, and layers.
*	Extract the cleanup code of Perl_do_openn() into S_openn_cleanup().	Nicholas Clark	2014-03-19	1	-0/+5
\| \| \| \| \| \|	A 12 parameter function is extremely ugly (as demonstrated by the need to add macros for it to perl.h), but it's private, and it will permit the two-headed public interface of Perl_do_openn() to be simplified.
*	Extract the setup code of Perl_do_openn() into S_openn_setup().	Nicholas Clark	2014-03-19	1	-0/+5
\|
*	Split out part of hv_auxinit() so it can be reused	Yves Orton	2014-03-18	1	-0/+1
\| \| \| \| \| \|	Changes nothing except that it introduces hv_auxinit_interal() which does part of the job of hv_auxinit(), so that we can call it from somewhere else in the next commit.
*	rpeep(): remove trailing OP_NULLs etc	David Mitchell	2014-03-16	1	-1/+1
\| \| \| \| \| \| \| \|	Perl_rpeep() elides OP_NULLs etc in the middle of an op_next chain, but not at the start or end. Doing it at the start is hard (and not addressed here); doing it at the end is trivial, and it just looks like a mistake in the original code (there since 1994) that was (incorrectly) worried about following through a null pointer.
*	regcomp.c: Use minimal struct formal parameter	Karl Williamson	2014-03-04	1	-1/+1
\| \| \| \| \| \| \|	The static function get_ANYOF_cp_list_for_ssc() takes a struct formal parameter that is a superset of what it actually uses. The calls to it have to cast to that superset. By setting the parameter to the smallest structure it uses, we simplify things.
*	Optimization: Remove needless list/pushmark pairs from the OP execution	Steffen Mueller	2014-02-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an optimization for OP trees that involve list OPs in list context. In list context, the list OP's first child, a pushmark, will do what its name claims and push a mark to the mark stack, indicating the start of a list of parameters to another OP. Then the list's other child OPs will do their stack pushing. Finally, the list OP will be executed and do nothing but undo what the pushmark has done. This is because the main effect of the list OP only really kicks in if it's not in array context (actually, it should probably only kick in if it's in scalar context, but I don't know of any valid examples of list OPs in void contexts). This optimization is quite a measurable speed-up for array or hash slicing and some other situations. Another (contrived) example is that (1,2,(3,4)) now actually is the same, performance-wise as (1,2,3,4), albeit that's rarely relevant. The price to pay for this is a slightly convoluted (by standards other than the perl core) bit of optimization logic that has to do minor look-ahead on certain OPs in the peephole optimizer. A number of tests failed after the first attack on this problem. The failures were in two categories: a) Tests that are sensitive to details of the OP tree structure and did verbatim text comparisons of B::Concise output (ouch). These are just patched according to the new red in this commit. b) Test that validly failed because certain conditions in op.c were expecting OP_LISTs where there are now OP_NULLs (with op_targ=OP_LIST). For these, the respective conditions in op.c were adjusted. The change includes modifying B::Deparse to handle the new OP tree structure in the face of nulled OP_LISTs.
*	Improve how regprop dumps REF-like nodes during execution	Yves Orton	2014-02-24	1	-1/+1
\| \| \| \| \|	We pass in the regmatch_info struct, which allows us to dump what a given REF is going to match.
*	regcomp.c: Don't read uninitialized data	Karl Williamson	2014-02-19	1	-1/+1
\| \| \| \| \| \| \| \|	I keep forgetting that the OP of a regnode is not defined in Pass 1 of the regex compiler. This is likely the cause of inconsistent results in lib/locale.t, as valgrind shows there to be a read of uninitialized data before this patch, and the result is randomly tainting when there shouldn't be, consistent with the test failures.
*	regcomp.c: Remove no longer used function	Karl Williamson	2014-02-19	1	-1/+0
\| \| \| \|	I don't think this function will need to be used again.
*	regcomp.c: Fix more alignment problems	Karl Williamson	2014-02-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I believe this will fix the remaining alignment problems recently being shown on gcc on HP-UX, It works on the procura machine. regnodes should not have stricter alignment than required by U32, for reasons given in the comments this commit adds to the beginning of regcomp.h. Commit 31f05a37 added a new ANYOF regnode struct with a pointer field. This requires stricter alignment on some 64-bit platforms, and hence doesn't work on those platforms. This commit removes that regnode struct type, and instead stores the pointer it used via a more indirect, but already existing mechanism that stores other data.. The function that returns that other data is enlarged to return this new field as well. It now needs to be called from regcomp.c, so the previous commit had renamed and made it accessible from there. The "public" function that wraps this one is unchanged. (I put "public" in quotes here, because I don't think anyone outside core is or should be using it, but since it has been publicly available for a long time, I'm treating the API as unchangeable. regcomp.c called this public function before this commit, but needs the additional data returned by the inner one).
*	regexec.c: Rename function, add parameter, make non-static	Karl Williamson	2014-02-19	1	-3/+7
\| \| \| \| \| \|	This is in preparation for a future commit where the function does more things so its current name would be misleading. It will need to be callable from regcomp.c as well.
*	Convert more EXACTFish nodes to EXACT when possible	Karl Williamson	2014-02-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Under /i matching, many characters match only themselves, such a punctuation. If a node contains only such characters it can be an EXACT node. The optimizer gets better hints when dealing with EXACT nodes than ones with folding. This changes the alloc_maybe_populate() function to look for possibilities of non-folding input.
*	regcomp.c: Fix some alignment problems	Karl Williamson	2014-02-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The bracketed character class (e.g. /[abc]/) in regular expression patterns is implemented as an ANYOF regnode. There are several different structs used for these, each a superset of the next smaller size, with extra fields tacked on to its end. Bits in the part common to all of them are set to indicate which size this particular instance is. Several functions in regcomp.c take the largest of these as a formal parameter, even though a smaller one may actually be passed. This avoids the need to have casts to access the optional fields, but the code needs to be careful to check the common part bits before trying to access a portion that may not actually be present. This practice dates to at least Perl v5.6.2. It turns out that there is further a problem with this if the tacked-on fields require a stricter alignment than the common fields. The code in the functions may assume that the actual parameter has suitable alignment, which may not be the case. Some months ago I added some extra optional pointer fields, which have stricter alignment requirements on 64-bit machines than the common portion, but no apparent problems ensued. Then, I changed things slightly, so that the gcc compiler on HP machines found an optimization possibility whose use required the proper alignment, which wasn't present, and bus errors started happening there. Tony Cook diagnosed the problem. A summary of his work can be found at http://markmail.org/message/hee5zyah7rb62c72 This commit changes the formal parameter to the smallest ANYOF struct, and uses casts to acess the optional portions. I don't know how common the coding style formerly used in regcomp.c is, but it is dangerous and can lead to unrelated changes causing errors. This commit should enable gcc builds to complete on the HP gcc smokers (previously miniperl built, but crashed in building the rest of perl), but we're not sure because unrelated header issues on the gcc on the machine that we have access to prevent blead from fully compiling there. There remain alignment bugs which will cause the tests to fail there, as the appended pointer field needs to have strict alignment on that platform, but when the regnodes are allocated alignment isn't done. I am working on fixing those.
*	Emulate POSIX locale setting on Windows	Karl Williamson	2014-02-15	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Locale initialization and setting on Windows haven't been as described in perllocale for setting locales to "". This is because that tells Windows to use the system default locale, as set through the Control Panel, but on POSIX systems, it means to look at various environment variables. This commit creates a wrapper for setlocale, used only on Windows, that looks for the appropriate environment variables when called with a "" input locale. If none are found, it continues to use the system default locale.
*	re_intuit_start(): bias last* vars; revive reghop4	David Mitchell	2014-02-07	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the "just matched float substr, now match fixed substr" branch, initially add an extra prog->anchored_offset to the last and last2 vars; since a lot of the later calculations involve adding anchored_offset, doing this early to the last* vars means less work in some cases. In particular, last is calculated from s by a single HOP4(s, prog->anchored_offset-start_shift,...) rather than two separate HOP3(s, -start_shift,...); HOP3(..., prog->anchored_offset,...); which may mostly cancel each other out. Similarly with last2. Later, we can skip adding prog->anchored_offset to last1, since its antecedents already have the bias added. In the case of failure, calculating a new start position involves an extra HOP to s, but removes a HOP from other_last, so the two cancel out. To make this work, I revived the reghop4() function which had been commented out, and added a HOP4c() wrapper macro. This is like HOP3c(), but allows you to specify both lower and upper limits. Useful when you don't know the sign of the offset in advance. (Yves had earlier added this function, but had commented it out until such time as it was actually used.) I also added some extra comments to this block and removed the comment about it being maybe broken under utf8, since I'm auditing the code for utf8-safeness.
*	merge basic zefram/purple_signatures into blead	Zefram	2014-02-06	1	-0/+1
\|\
\| *	Merge blead into zefram/purple_signatures	Zefram	2014-02-01	1	-2/+2
\| \|\
\| * \|	subroutine signatures	Zefram	2014-02-01	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Declarative syntax to unwrap argument list into lexical variables. "sub foo ($a,$b) {...}" checks number of arguments and puts the arguments into lexical variables. Signatures are not equivalent to the existing idiom of "sub foo { my($a,$b) = @_; ... }". Signatures are only available by enabling a non-default feature, and generate warnings about being experimental. The syntactic clash with prototypes is managed by disabling the short prototype syntax when signatures are enabled.
* \| \|	Forbid "\c{" and \c{non-ascii}	Karl Williamson	2014-02-05	1	-1/+1
\| \|/ \|/\| \| \| \| \| \| \|	These constructs have been deprecated since v5.14 with the intention of making them fatal in 5.18. This wasn't done; and is being done now.
* \|	update embed.fnc now op_null and op_free have docs	David Mitchell	2014-01-29	1	-2/+2
\|/
*	regcomp.c: Change a variable and flag bit names	Karl Williamson	2014-01-27	1	-1/+1
\| \| \| \| \|	The meaning of these was expanded two commits ago, so update the name to reflect this, to prevent future confusion
*	Work properly under UTF-8 LC_CTYPE locales	Karl Williamson	2014-01-27	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
*	Taint more operands with case changes	Karl Williamson	2014-01-27	1	-4/+8
\| \| \| \| \| \| \| \| \| \|	The documentation says that Perl taints certain operations when subject to locale rules, such as lc() and ucfirst(). Prior to this commit there were exceptions when the operand to these functions contained no characters whose case change actually varied depending on the locale, for example the empty string or above-Latin1 code points. Changing to conform to the documentation simplifies the core code, and yields more consistent results.
*	regcomp.c: Extract out code into a separate function	Karl Williamson	2014-01-22	1	-0/+1
\| \| \| \|	This is in preparation for it to be called from a 2nd place.