delta/perl.git - github.com: perl/perl5.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	[perl #121771] Revert the new warning for ++ on non- /\A[a-zA-Z]+[0-9]*\z/	Tony Cook	2014-05-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This failed as in it was producing: Argument "123abc" treated as 0 in increment (++) at -e line 1. when the user incremented that value (which is a lie). This reverts commits 8140a7a801e37d147db0e5a8d89551d9d77666e0 and 2cd5095e471e1d84dc9e0b79900ebfd66aabc909. I expect to revert this commit, and add fixes, after 5.20 is released. Conflicts: pod/perldiag.pod
*	my_plvarsp nulling and PERL_GLOBAL_STRUCT_PRIVATE	David Mitchell	2014-04-24	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With PERL_GLOBAL_STRUCT_PRIVATE, all "global" vars are in a malloc()d structure pointed to by the static var my_plvarsp. At exit, this struct is freed and my_plvarsp is set to NULL. My previous commit c1181d2b skipped the free if PL_veto_cleanup is set (as it would be if other threads are still running for example), but still left my_plvarsp getting set to NULL. Thus other threads could still deref a null pointer if they accessed a "global" var just as the main thread was exiting. This commit makes the veto skip the NULLing in addition to the freeing. This commit is quite late into the code freeze, but it's a follow-up to the earlier attempt to get smokes not to fail, and all the affected code is within #ifdef PERL_GLOBAL_STRUCT_PRIVATE, so it shouldn't affect mainstream builds at all. (Famous last words.)
*	import experimental.pm	Ricardo Signes	2014-04-15	1	-0/+1
\|
*	PerlIO.pm: Make pod :utf8 caution more prominent	Karl Williamson	2014-04-13	1	-4/+4
\| \| \| \| \| \|	The :utf8 layer continues to have security issues. This moves the warning about that to earlier where it's more likely to be seen, and makes it stand out more.
*	bump version on debugger	Ricardo Signes	2014-04-07	1	-1/+1
\|
*	properly reset ReadLine's knowledge of handles after pager	Hiroo Hayashi	2014-04-07	1	-0/+3
\| \| \| \|	[perl #121456]
*	utf8: add tests for behavior change in v5.15.6-407-gc710240, and more	Ævar Arnfjörð Bjarmason	2014-04-02	1	-0/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In v5.15.6-407-gc710240 Father Chrysostomos patched utf8::decode() so it would call SvPV_force_nolen() on its argument. This meant that calling utf8::decode() with a non-blessed non-overloaded reference would now coerce the reference scalar to a string, i.e. before we'd do: $ ./perl -Ilib -MDevel::Peek -wle 'use strict; print $]; my $s = shift; my $s_ref = \$s; utf8::decode($s_ref); Dump $s_ref; print $$s_ref' ævar 5.019011 SV = IV(0x2579fd8) at 0x2579fe8 REFCNT = 1 FLAGS = (PADMY,ROK) RV = 0x25c33d8 SV = PV(0x257ab08) at 0x25c33d8 REFCNT = 2 FLAGS = (PADMY,POK,pPOK) PV = 0x25a1338 "\303\246var"\0 CUR = 5 LEN = 16 ævar But after calling SvPV_force_nolen(sv) we'd instead do: $ ./perl -Ilib -MDevel::Peek -wle 'use strict; print $]; my $s = shift; my $s_ref = \$s; utf8::decode($s_ref); Dump $s_ref; print $$s_ref' ævar 5.019011 SV = PVIV(0x140e4b8) at 0x13e7fe8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) IV = 0 PV = 0x140c578 "SCALAR(0x14313d8)"\0 CUR = 17 LEN = 24 Can't use string ("SCALAR(0x14313d8)") as a SCALAR ref while "strict refs" in use at -e line 1. I think this is arguably the right thing to do, we wouln't actually utf8 decode the containing scalar so this reveals bugs in code that passed references to utf8::decode(), what you want is to do this instead: $ ./perl -CO -Ilib -MDevel::Peek -wle 'use strict; print $]; my $s = shift; my $s_ref = \$s; utf8::decode($$s_ref); Dump $s_ref; print $$s_ref' ævar 5.019011 SV = IV(0x1aa8fd8) at 0x1aa8fe8 REFCNT = 1 FLAGS = (PADMY,ROK) RV = 0x1af23d8 SV = PV(0x1aa9b08) at 0x1af23d8 REFCNT = 2 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x1ad0338 "\303\246var"\0 [UTF8 "\x{e6}var"] CUR = 5 LEN = 16 ævar However I think we should be more consistent here, e.g. we'll die when utf8::upgrade() gets passed a reference, but utf8::downgrade() just passes it through. I'll file a bug for that separately.
*	PATCH: [perl #119499] "$!" with UTF-8 flag	Karl Williamson	2014-04-01	1	-4/+4
\| \| \| \| \| \| \| \|	This disables the code that sets the UTF-8 flag when "$!" is UTF-8. This is being done to get v5.20 out the door, with changes to follow in v5.21. See towards the end of the discussion of this ticket. Unfortunately this change will cause #112208 to no longer be fixed.
*	mktables: In-line defns for tables up to 3 ranges	Karl Williamson	2014-03-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	eb0925341cc65ce6ce57503ec0ab97cdad39dc98 caused the definitions for about 45% of the Unicode tables to be placed in-line in Heavy.pl instead of them having to be read-in from disk. This new commit extends that so that about 55% are in-lined, by in-lining tables which consist of up to 3 ranges. This is a no-brainer to do, as the memory usage does not increase by doing it, and disk accesses go down. I used the delta in the disk size of Heavy.pl as a proxy for the delta in the memory size that it uses, as what this commit does is to change various scalar strings in it. Doing this measurement indicates that this commit results in a slightly smaller Heavy.pl than what was there before eb092534. The amounts will vary between Unicode releases. I also checked for Unicode beta 7.0, and the sizes are again comparable, with a slightly larger Heavy.pl for the 3-range version there. For 4-, 5-, ... range tables, doing this results in slowly increasing Heavy.pl size (and hence more and more memory use), and that is something we may wish to look at in the future, trading memory for fewer files and less disk start-up cost. But for the imminent v5.20, doing it for 3-range tables doesn't cost us anything, and gains us fewer disk files and accesses.
*	mktables: Remove obsolete sort constraint	Karl Williamson	2014-03-18	1	-8/+6
\| \| \| \| \| \|	Zero-length tables are no longer expressed in terms of the Perl 'All' property, so the sort no longer has to make sure the latter is processed before the former.
*	mktables: Add comments, reorder a gen'd file	Karl Williamson	2014-03-18	1	-7/+12
\| \| \| \| \|	This adds some clarifying comments, and reorders Heavy.pl so the array referred to by two hashes occurs before both hashes in the output.
*	mktables: White-space only	Karl Williamson	2014-03-18	1	-9/+9
\| \| \| \| \|	This indents code to conform to the new block created in the previous commit
*	utf8_heavy.pl: Change data structure for in-lined definitions	Karl Williamson	2014-03-18	3	-17/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit puts the in-lined definitions introduced by eb0925341cc65ce6ce57503ec0ab97cdad39dc98 into a separate array, where each element is a unique definition. This can result in slightly smaller current memory usage in utf8_heavy.pl, as the strings giving the file names now only appear once, and what replaces them in the hash values are strings of digits indices, which are shorter. But doing this allows us in a later commit to increase the number of ranges in the tables that get in-lined, without increasing memory usage. This commit also changes the code that generates the definitions to be more general, so that it becomes trivial to change things to generate in-line definitions for tables with more than one range.
*	mktables: Fix overlooked in-line table defns code	Karl Williamson	2014-03-18	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Commit eb0925341cc65ce6ce57503ec0ab97cdad39dc98 introduced the idea of a pseudo-directory as a way to store table definitions in-line in Heavy.pl, but conform to the expectations of the code in regard to objects being files within directories. This kept the needed changes to a minimum. The code changed by the current commit was overlooked then as something that also needed to change, because there are no current instances of it needing to. But this could change with future Unicode versions, or as in the next few commits, in extending the in-line definitions.
*	regenerate warnings.pm	Ricardo Signes	2014-03-18	1	-14/+586
\|
*	replace links to perllexwarn with links to warnings	Ricardo Signes	2014-03-18	2	-4/+3
\| \| \| \|	or, sometimes, simply remove them
*	White-space only; properly indent newly formed blocks	Karl Williamson	2014-03-14	1	-9/+9
\| \| \| \| \|	The previous commit added braces forming blocks. This indents the contents of those blocks.
*	mktables: Inline short tables	Karl Williamson	2014-03-14	3	-23/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mktables generates tables of Unicode properties. These are stored in files to be loaded on-demand. This is because the memory cost of having all of them loaded would be excessive, and many are rarely used. Hashes are created in Heavy.pl which is read in by utf8_heavy.pl map the Unicode property name to the file which contains its definition. It turns out that nearly half of current Unicode properties are just a single consecutive ranges of code points, and their definitions are representable almost as compactly as the name of the files that contain them. This commit changes mktables so that the tables for single-range properties are not written out to disk, but instead a special syntax is used in Heavy.pl to indicate this and what their definitions are. This does not increase the memory usage of Heavy.pl appreciably, as the definitions replace the file names that are already there, but it lowers the number of files generated by mktables from 908 (in Unicode 6.3) to 507. These files are probably each a disk block, so the disk savings is not large. But it means that reading in any of these properties is much faster, as once utf8_heavy gets loaded, no further disk access is needed to get any of these properties. Most of these properties are obscure, but not all. The Line and Paragraph separators, for example, are quite commonly used. Further, utf8_heavy.pl caches the files it has read in into hashes. This is not necessary for these, as they are already in memory, so the total memory usage goes down if a program uses any of these, but again, since these are small, that amount is not large.. The major gain is not having to read these files from disk at run time. Tables that match no code points at all are also represented using this mechanimsm. Previously, they were expressed as the complements of \p{All}, which matches everything possible.
*	lib/locale.t: Update $variable name	Karl Williamson	2014-03-13	1	-4/+4
\| \| \| \| \| \|	As of commit b057411ddb1a3d8b6ab062d667c8e39f80cd7343, the meaning of the variable is extended to beyond just being about 'folding', so change the name to correspond.
*	PATCH: [perl #121340] lib/locale.t noisy+complaining but passes on Win32	Karl Williamson	2014-03-13	1	-4/+4
\| \| \| \| \|	It turns out that these messages were not printed as one would expect under TAP, but were output using warn().
*	lib/locale.t: Fix broken test	Karl Williamson	2014-03-13	1	-10/+45
\| \| \| \| \| \| \| \| \|	The test that [:digit:] is a subset of [:xdigit:] failed in locales where [:digit:] matches 2 blocks of 10 digits, but the second block isn't considered part of [:xdigit:]. This happens in Thai on Windows. The POSIX standard http://pubs.opengroup.org/onlinepubs/9699919799/ does not use very clear language, but I'm taking it as meaning it is ok for this to happen, so this revises the test to accept it.
*	[perl #121362] overload optimisation added a SEGV	David Mitchell	2014-03-04	1	-1/+30
\| \| \| \| \| \| \| \| \| \| \|	My recent commit 3d147ac29d12abdb to "speed up (non)overloaded derefs" introduced a potential SEGV. In Perl_Gv_AMupdate(), the 'aux' variable is set to HvAUX(hv). My patch used the value of the variable later on in the function, but it turns out that by then, S_hsplit() may have been called, and thus HvARRAY (and HvAUX()) may have been reallocated. Issue first spotted by Andreas' awesome BBC service, and diagnosed by Nicholas Clark.
*	perluniprops: Show property name without braces	Karl Williamson	2014-03-01	1	-0/+5
\| \| \| \| \| \| \| \| \|	Properties wth single letter names may be expressed with and without the brakces; \p{L} and \pL are synonymous. This commit makes both forms be in perluniprops, so someone who doesn't know the detailed rules can search for either to see what it is. This was suggested by Zsbán Ambrus.
*	Unicode/UCD.t: Fix broken test	Karl Williamson	2014-03-01	1	-6/+2
\| \| \| \| \| \| \| \|	The test file special cases certain properties by name. However, it turns out that a Unihan property that isn't normally compiled by Perl also should be included. And all these properties share the same format given in their files. So, instead of using the property names, use that format; this leads to code which is general, and simpler at the same time.
*	mktables: Allow Unicode Unihan files to compile	Karl Williamson	2014-03-01	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Perl normally doesn't include the Unicode Unihan files, but someone is free to recompile Perl with these. However, starting with commit 9e65c3f47e483ee7e33b5d748a06f4addd830d60, mktables checks for the version number on the input files and refuses to compile if incorrect. (This is to catch Perl trying to compile from a DB with inconsistent files; I believe Perl used to be shipped with these synchronization errors.) However, the Unihan files in Unicode 6.3 do not have the same syntax as the rest of the files, so since that commit Perl refuses to compile Unihan. The files are being updated in 7.0 to use the same syntax as the rest, so rather than hard-code the current syntax as an exception into mktables, this just skips checking these files until 7.0.
*	make OP_AELEMFAST work with negative indices	David Mitchell	2014-02-28	2	-2/+16
\| \| \| \| \| \| \| \| \| \|	Use aelemfast for literal index array access where the index is in the range -128..127, rather than 0..255. You'd expect something like $a[-1] or $a[-2] to be a lot more common than $a[100] say. In fact a quick CPAN grep shows 66 distributions matching /\$\w+\[\d{3,}\]/, but "at least" 1000 matching /\$\w+\[\-\d\]/. And most of the former appear to be table initialisations.
*	Optimization: Remove needless list/pushmark pairs from the OP execution	Steffen Mueller	2014-02-26	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is an optimization for OP trees that involve list OPs in list context. In list context, the list OP's first child, a pushmark, will do what its name claims and push a mark to the mark stack, indicating the start of a list of parameters to another OP. Then the list's other child OPs will do their stack pushing. Finally, the list OP will be executed and do nothing but undo what the pushmark has done. This is because the main effect of the list OP only really kicks in if it's not in array context (actually, it should probably only kick in if it's in scalar context, but I don't know of any valid examples of list OPs in void contexts). This optimization is quite a measurable speed-up for array or hash slicing and some other situations. Another (contrived) example is that (1,2,(3,4)) now actually is the same, performance-wise as (1,2,3,4), albeit that's rarely relevant. The price to pay for this is a slightly convoluted (by standards other than the perl core) bit of optimization logic that has to do minor look-ahead on certain OPs in the peephole optimizer. A number of tests failed after the first attack on this problem. The failures were in two categories: a) Tests that are sensitive to details of the OP tree structure and did verbatim text comparisons of B::Concise output (ouch). These are just patched according to the new red in this commit. b) Test that validly failed because certain conditions in op.c were expecting OP_LISTs where there are now OP_NULLs (with op_targ=OP_LIST). For these, the respective conditions in op.c were adjusted. The change includes modifying B::Deparse to handle the new OP tree structure in the face of nulled OP_LISTs.
*	lib/locale.t: Make more tests not fail unless is bad for enough locales	Karl Williamson	2014-02-24	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	locale.t has some tests that fail even one locale fails; and it has some tests where failure doesn't happen unless a sufficient percentage of locales have the same problem. The first set should be for tests whose failure indicates a basic problem in locale handling; and the second set should be for tests where it could be that just a locale definition is bad. Prior to this patch, tests dealing with radix problems were considered in the first category, but in fact it's possible that just the locale definition for the radix is wrong. This is what happened for some older Darwin versions for their Basque locales, which caused locale.t to show failures, whereas it was just these locales that were bad, and the generic handling was ok, or good enough. (The actual failures had the radix be the two character string: apostrophe followed by a blank. It would be a lot of work to make Perl deal with having a quote character also mean a decimal point, and that work isn't worth it, especially as this was a locale definition error, and we don't know of any locale in the world where an apostrophe is legitimately a radix character.) For this commit, I looked through the tests, and I added the tests where it seemed that the problem could just be a bad locale definition to the list of such tests. Note that failures here could mean an internal Perl error, but in that case, it should affect many more locales, so will show up anyway as the failure rate should exceed the acceptable one.
*	lib/locale.t: Change an array to a hash	Karl Williamson	2014-02-24	1	-4/+4
\| \| \| \| \| \| \|	This is more naturally a hash in that it is a list of numbers, not necessarily consecutive, and each time through the loop the same number was getting pushed, so had multiple entries for each by the time it was finished.
*	Missing version bump for Deparse	Steffen Mueller	2014-02-24	1	-1/+1
\| \| \| \| \|	Not a good run, I am having. But also not anything tangible that our silly tests caught me on. :(
*	B::Deparse: Padrange deparse fix	Steffen Mueller	2014-02-24	1	-6/+6
\| \| \| \| \| \| \| \| \|	The PADRANGE support fakes up a PUSHMARK OP but until this commit, it did so incompletely since it never overrode the OP type (that was still an OP_PADRANGE). This addresses that. On top of it, there's two minor changes that switch from "eq" to "==" for comparing numeric OP types.
*	Change 'semantics' to 'rules'	Karl Williamson	2014-02-20	2	-5/+5
\| \| \| \| \| \|	The term 'semantics' in documentation when applied to character sets is changed to 'rules' as being a shorter less-jargony synonym in this case. This was discussed several releases ago, but I didn't get around to it.
*	lib/locale.t: Remove tests that need UTF-8 locale	Karl Williamson	2014-02-19	1	-9/+0
\| \| \| \| \| \| \| \| \| \|	These tests should not be here because they will only match under a UTF-8 locale, which happens to be the case on the machine I developed them on, but not necessarily always true, and so they are failing. Given the deadline is already past, I'm just removing them for now, and will re-add them later in another place in the file where we know we are using a UTF-8 locale.
*	Make taint checking regex compile time instead of runtime	Karl Williamson	2014-02-19	1	-0/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	See discussion at https://rt.perl.org/Ticket/Display.html?id=120675 There are several unresolved items in this discussion, but we did agree that tainting should be dependent only on the regex pattern, and not the particular input string being matched against: "The bottom line is we are moving to the policy that tainting is based on the operation being in locale, without regard to the particular operand's contents passed this time to the operation. This means simpler core code and more consistent tainting results. And it lessens the likelihood that there are paths in the core that should taint but don't" This commit does the minimal work to change regex pattern matching to determine tainting at pattern compilation time. Simply put, if a pattern contains a regnode whose match/not match depends on the run-time locale, any attempt to match against that pattern will taint, regardless of the actual target string or runtime locale in effect. Given this change, there are optimizations that can be made to avoid runtime work, but these are deferred until later. Note that just because a regular expression is compiled under locale doesn't mean that the generated pattern will be tainted. It depends on the actual pattern. For example, the pattern /(.)/ doesn't taint because it will match exactly one character of the input, regardless of locale settings.
*	lib/locale.t: Add some test names	Karl Williamson	2014-02-19	1	-92/+92
\|
*	lib/locale.t: Untaint before checking if next thing taints	Karl Williamson	2014-02-19	1	-0/+11
\| \| \| \| \|	The tests weren't testing what they purported to, as we should be sure to start with untained values to see if the operation taints.
*	Correct number of tests in plan.	James E Keenan	2014-02-19	1	-1/+1
\|
*	restore $PERL_OLD_VERSION to English.pm	David Golden	2014-02-18	2	-2/+4
\| \| \| \| \| \|	In the dark ages, when $^V replaced $] for $PERL_VERSION, $PERL_OLD_VERSION was added as a comment in the list of deprecated variable. Since $] is not deprecated, this commit restores it.
*	[perl #121081] workaround different output on VMS	Tony Cook	2014-02-19	1	-5/+6
\| \| \| \| \|	VMS is a special snowflake, deal with the slightly different debugger output it produces.
*	Skip locale test on OpenBSD, MirBSD and Bitrig too	Chris 'BinGOs' Williams	2014-02-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	From the original ticket #115808 the following should produce "Use of uninitialized value in print at -e line 1." $ perl -wle 'use POSIX; print length setlocale POSIX::LC_ALL, "mtfnpy"' 16 So skip this test on OpenBSD, MirBSD and Bitrig
*	Expand tabs in diagnostics.pm	Father Chrysostomos	2014-02-08	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise pod like this: The second situation is caused by an eval accessing a lexical subroutine that has gone out of scope, for example, sub f { my sub a {...} sub { eval '\&a' } } f()->(); is turned into this: The second situation is caused by an eval accessing a variable that has gone out of scope, for example, sub f { my $a; sub { eval '$a' } } f()->(); instead of this: The second situation is caused by an eval accessing a variable that has gone out of scope, for example, sub f { my $a; sub { eval '$a' } } f()->(); I don’t know how to test this without literally copying and pasting parts of diagnostics.pm into diagnostics.t. But I have tested it man- ually and it works.
*	diagnostics.pm: Eliminate $WHOAMI	Father Chrysostomos	2014-02-08	1	-5/+6
\| \| \| \| \| \| \| \|	This variable only held the package name. __PACKAGE__ is faster, as it allows constant folding. diagnostics.pm just happens to be older than __PACKAGE__, which was introduced as recently as 1997 (68dc074516).
*	Increase $diagnostics::VERSION to 1.34	Father Chrysostomos	2014-02-08	1	-1/+1
\|
*	merge basic zefram/purple_signatures into blead	Zefram	2014-02-06	2	-13/+37
\|\
\| *	Merge blead into zefram/purple_signatures	Zefram	2014-02-01	1	-8/+18
\| \|\
\| * \|	subroutine signatures	Zefram	2014-02-01	2	-13/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Declarative syntax to unwrap argument list into lexical variables. "sub foo ($a,$b) {...}" checks number of arguments and puts the arguments into lexical variables. Signatures are not equivalent to the existing idiom of "sub foo { my($a,$b) = @_; ... }". Signatures are only available by enabling a non-default feature, and generate warnings about being experimental. The syntactic clash with prototypes is managed by disabling the short prototype syntax when signatures are enabled.
* \| \|	Don't test locales that are invalid for needed categories	Karl Williamson	2014-02-04	1	-4/+3
\| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When looking for locales to test, skip ones which aren't defined in every locale category we care about. This was motivated by a Net BSD machine which has a Pig Latin locale, but it is defined only for LC_MESSAGES. This necessitated adding parameters to pass the desired locale(s), and renaming a test function to indicate the current category it is valid for.
* \|	lib/locale.t: Better debug information	Karl Williamson	2014-01-29	1	-8/+18
\|/ \| \| \|	This adds a couple of lines of information, and sorts some other output
*	mktables: Refer to an actual commit number	Karl Williamson	2014-01-28	1	-2/+3
\|
*	Work properly under UTF-8 LC_CTYPE locales	Karl Williamson	2014-01-27	3	-28/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.