| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
The previous commit added braces forming blocks. This indents the
contents of those blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mktables generates tables of Unicode properties. These are stored in
files to be loaded on-demand. This is because the memory cost of having
all of them loaded would be excessive, and many are rarely used. Hashes
are created in Heavy.pl which is read in by utf8_heavy.pl map the
Unicode property name to the file which contains its definition.
It turns out that nearly half of current Unicode properties are just a
single consecutive ranges of code points, and their definitions are
representable almost as compactly as the name of the files that contain
them.
This commit changes mktables so that the tables for single-range
properties are not written out to disk, but instead a special syntax is
used in Heavy.pl to indicate this and what their definitions are.
This does not increase the memory usage of Heavy.pl appreciably, as the
definitions replace the file names that are already there, but it lowers
the number of files generated by mktables from 908 (in Unicode 6.3) to
507. These files are probably each a disk block, so the disk savings is
not large. But it means that reading in any of these properties is much
faster, as once utf8_heavy gets loaded, no further disk access is needed
to get any of these properties. Most of these properties are obscure,
but not all. The Line and Paragraph separators, for example, are quite
commonly used.
Further, utf8_heavy.pl caches the files it has read in into hashes.
This is not necessary for these, as they are already in memory, so the
total memory usage goes down if a program uses any of these, but again,
since these are small, that amount is not large.. The major gain is not
having to read these files from disk at run time.
Tables that match no code points at all are also represented using this
mechanimsm. Previously, they were expressed as the complements of
\p{All}, which matches everything possible.
|
|
|
|
|
|
| |
As of commit b057411ddb1a3d8b6ab062d667c8e39f80cd7343, the meaning of
the variable is extended to beyond just being about 'folding', so change
the name to correspond.
|
|
|
|
|
| |
It turns out that these messages were not printed as one would expect
under TAP, but were output using warn().
|
|
|
|
|
|
|
|
|
| |
The test that [:digit:] is a subset of [:xdigit:] failed in locales
where [:digit:] matches 2 blocks of 10 digits, but the second block
isn't considered part of [:xdigit:]. This happens in Thai on Windows.
The POSIX standard http://pubs.opengroup.org/onlinepubs/9699919799/
does not use very clear language, but I'm taking it as meaning it is ok
for this to happen, so this revises the test to accept it.
|
|
|
|
|
|
|
|
|
|
|
| |
My recent commit 3d147ac29d12abdb to "speed up (non)overloaded derefs"
introduced a potential SEGV. In Perl_Gv_AMupdate(), the 'aux' variable is
set to HvAUX(hv). My patch used the value of the variable later on in the
function, but it turns out that by then, S_hsplit() may have been called,
and thus HvARRAY (and HvAUX()) may have been reallocated.
Issue first spotted by Andreas' awesome BBC service, and diagnosed by
Nicholas Clark.
|
|
|
|
|
|
|
|
|
| |
Properties wth single letter names may be expressed with and without the
brakces; \p{L} and \pL are synonymous. This commit makes both forms
be in perluniprops, so someone who doesn't know the detailed rules can
search for either to see what it is.
This was suggested by Zsbán Ambrus.
|
|
|
|
|
|
|
|
| |
The test file special cases certain properties by name. However, it
turns out that a Unihan property that isn't normally compiled by Perl
also should be included. And all these properties share the same format
given in their files. So, instead of using the property names, use that
format; this leads to code which is general, and simpler at the same time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl normally doesn't include the Unicode Unihan files, but someone is
free to recompile Perl with these. However, starting with commit
9e65c3f47e483ee7e33b5d748a06f4addd830d60, mktables checks for the
version number on the input files and refuses to compile if incorrect.
(This is to catch Perl trying to compile from a DB with inconsistent
files; I believe Perl used to be shipped with these synchronization
errors.) However, the Unihan files in Unicode 6.3 do not have the same
syntax as the rest of the files, so since that commit Perl refuses to
compile Unihan.
The files are being updated in 7.0 to use the same syntax as the rest,
so rather than hard-code the current syntax as an exception into
mktables, this just skips checking these files until 7.0.
|
|
|
|
|
|
|
|
|
|
| |
Use aelemfast for literal index array access where the index is in the
range -128..127, rather than 0..255.
You'd expect something like $a[-1] or $a[-2] to be a lot more common than
$a[100] say. In fact a quick CPAN grep shows 66 distributions
matching /\$\w+\[\d{3,}\]/, but "at least" 1000 matching /\$\w+\[\-\d\]/.
And most of the former appear to be table initialisations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an optimization for OP trees that involve list OPs in list
context. In list context, the list OP's first child, a pushmark, will do
what its name claims and push a mark to the mark stack, indicating the
start of a list of parameters to another OP. Then the list's other
child OPs will do their stack pushing. Finally, the list OP will be
executed and do nothing but undo what the pushmark has done. This is
because the main effect of the list OP only really kicks in if it's
not in array context (actually, it should probably only kick in if
it's in scalar context, but I don't know of any valid examples of
list OPs in void contexts).
This optimization is quite a measurable speed-up for array or hash
slicing and some other situations. Another (contrived) example is
that (1,2,(3,4)) now actually is the same, performance-wise as
(1,2,3,4), albeit that's rarely relevant.
The price to pay for this is a slightly convoluted (by standards other
than the perl core) bit of optimization logic that has to do minor
look-ahead on certain OPs in the peephole optimizer.
A number of tests failed after the first attack on this problem. The
failures were in two categories:
a) Tests that are sensitive to details of the OP tree structure and did
verbatim text comparisons of B::Concise output (ouch). These are just
patched according to the new red in this commit.
b) Test that validly failed because certain conditions in op.c were
expecting OP_LISTs where there are now OP_NULLs (with op_targ=OP_LIST).
For these, the respective conditions in op.c were adjusted.
The change includes modifying B::Deparse to handle the new OP tree
structure in the face of nulled OP_LISTs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
locale.t has some tests that fail even one locale fails; and it has some
tests where failure doesn't happen unless a sufficient percentage of
locales have the same problem.
The first set should be for tests whose failure indicates a basic
problem in locale handling; and the second set should be for tests where
it could be that just a locale definition is bad.
Prior to this patch, tests dealing with radix problems were considered
in the first category, but in fact it's possible that just the locale
definition for the radix is wrong. This is what happened for some older
Darwin versions for their Basque locales, which caused locale.t to show
failures, whereas it was just these locales that were bad, and the
generic handling was ok, or good enough. (The actual failures had the
radix be the two character string: apostrophe followed by a blank. It
would be a lot of work to make Perl deal with having a quote character
also mean a decimal point, and that work isn't worth it, especially as
this was a locale definition error, and we don't know of any locale in
the world where an apostrophe is legitimately a radix character.)
For this commit, I looked through the tests, and I added the tests where
it seemed that the problem could just be a bad locale definition to the
list of such tests. Note that failures here could mean an internal Perl
error, but in that case, it should affect many more locales, so will
show up anyway as the failure rate should exceed the acceptable one.
|
|
|
|
|
|
|
| |
This is more naturally a hash in that it is a list of numbers, not
necessarily consecutive, and each time through the loop the same number
was getting pushed, so had multiple entries for each by the time it was
finished.
|
|
|
|
|
| |
Not a good run, I am having. But also not anything tangible
that our silly tests caught me on. :(
|
|
|
|
|
|
|
|
|
| |
The PADRANGE support fakes up a PUSHMARK OP but until this commit, it
did so incompletely since it never overrode the OP type (that was still
an OP_PADRANGE). This addresses that.
On top of it, there's two minor changes that switch from "eq" to "=="
for comparing numeric OP types.
|
|
|
|
|
|
| |
The term 'semantics' in documentation when applied to character sets is
changed to 'rules' as being a shorter less-jargony synonym in this case.
This was discussed several releases ago, but I didn't get around to it.
|
|
|
|
|
|
|
|
|
|
| |
These tests should not be here because they will only match under a
UTF-8 locale, which happens to be the case on the machine I developed
them on, but not necessarily always true, and so they are failing.
Given the deadline is already past, I'm just removing them for now, and
will re-add them later in another place in the file where we know we
are using a UTF-8 locale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See discussion at https://rt.perl.org/Ticket/Display.html?id=120675
There are several unresolved items in this discussion, but we did agree
that tainting should be dependent only on the regex pattern, and not the
particular input string being matched against:
"The bottom line is we are moving to the policy that tainting is based
on the operation being in locale, without regard to the particular
operand's contents passed this time to the operation. This means simpler
core code and more consistent tainting results. And it lessens the
likelihood that there are paths in the core that should taint but don't"
This commit does the minimal work to change regex pattern matching to
determine tainting at pattern compilation time. Simply put, if a
pattern contains a regnode whose match/not match depends on the run-time
locale, any attempt to match against that pattern will taint, regardless
of the actual target string or runtime locale in effect. Given this
change, there are optimizations that can be made to avoid runtime work,
but these are deferred until later.
Note that just because a regular expression is compiled under locale
doesn't mean that the generated pattern will be tainted. It depends on
the actual pattern. For example, the pattern /(.)/ doesn't taint
because it will match exactly one character of the input, regardless of
locale settings.
|
| |
|
|
|
|
|
| |
The tests weren't testing what they purported to, as we should be sure
to start with untained values to see if the operation taints.
|
| |
|
|
|
|
|
|
| |
In the dark ages, when $^V replaced $] for $PERL_VERSION,
$PERL_OLD_VERSION was added as a comment in the list of deprecated
variable. Since $] is *not* deprecated, this commit restores it.
|
|
|
|
|
| |
VMS is a special snowflake, deal with the slightly different debugger
output it produces.
|
|
|
|
|
|
|
|
|
|
| |
From the original ticket #115808 the following should produce
"Use of uninitialized value in print at -e line 1."
$ perl -wle 'use POSIX; print length setlocale POSIX::LC_ALL, "mtfnpy"'
16
So skip this test on OpenBSD, MirBSD and Bitrig
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise pod like this:
The second situation is caused by an eval accessing a lexical subroutine
that has gone out of scope, for example,
sub f {
my sub a {...}
sub { eval '\&a' }
}
f()->();
is turned into this:
The second situation is caused by an eval accessing a variable that has
gone out of scope, for example,
sub f {
my $a;
sub { eval '$a' }
}
f()->();
instead of this:
The second situation is caused by an eval accessing a variable that has
gone out of scope, for example,
sub f {
my $a;
sub { eval '$a' }
}
f()->();
I don’t know how to test this without literally copying and pasting
parts of diagnostics.pm into diagnostics.t. But I have tested it man-
ually and it works.
|
|
|
|
|
|
|
|
| |
This variable only held the package name. __PACKAGE__ is faster,
as it allows constant folding.
diagnostics.pm just happens to be older than __PACKAGE__, which was
introduced as recently as 1997 (68dc074516).
|
| |
|
|\ |
|
| |\ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Declarative syntax to unwrap argument list into lexical variables.
"sub foo ($a,$b) {...}" checks number of arguments and puts the
arguments into lexical variables. Signatures are not equivalent to the
existing idiom of "sub foo { my($a,$b) = @_; ... }". Signatures are only
available by enabling a non-default feature, and generate warnings about
being experimental. The syntactic clash with prototypes is managed by
disabling the short prototype syntax when signatures are enabled.
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When looking for locales to test, skip ones which aren't defined in
every locale category we care about. This was motivated by a Net BSD
machine which has a Pig Latin locale, but it is defined only for
LC_MESSAGES.
This necessitated adding parameters to pass the desired locale(s), and
renaming a test function to indicate the current category it is valid
for.
|
|/
|
|
| |
This adds a couple of lines of information, and sorts some other output
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
| |
|
|
|
|
|
|
|
| |
It has been hanging or unnecessarily using & since commit d16269d835
caused spaces to be preserved in the prototype and stripped when
applied during sub call compilation. That commit did not update
B::Deparse accordingly.
|
|
|
|
|
|
|
|
|
|
| |
The documentation says that Perl taints certain operations when subject
to locale rules, such as lc() and ucfirst(). Prior to this commit
there were exceptions when the operand to these functions contained no
characters whose case change actually varied depending on the locale,
for example the empty string or above-Latin1 code points. Changing to
conform to the documentation simplifies the core code, and yields more
consistent results.
|
|
|
|
|
|
|
| |
Prior to this patch, this was in regen/mk_invlists.pl, but future
commits will want it to also be used by the header generated by
regen/regcharclass.pl, so use a common source so the logic doesn't have
to be duplicated.
|
|
|
|
|
|
| |
This trivial code to determine if a locale is a utf8 one or not is
currently used in just one place, but future commits will use it in
others, and will make it non-trivial, and non-obvious.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of throwing an error, just go ahead and do the import.
This will tell Perl internally to use the current underlying locale,
which should be the C locale. Attempts to change the locale will fail.
This differs slightly from Brian Fraser's patch, in that his didn't
touch $^H, thus 'use locale' was a no-op. He has told me to apply this
one, which does affect $^H. The advantage here is that now programs
that are run on platforms with and without locales will behave
similarly, and should run identically if the locale is not switched from
the default.
|
|
|
|
| |
available
|
|
|
|
|
|
|
| |
As a result of some code meant to do the same thing being in two
different places, one got updated, and one didn't. So t/run/locale.t
was being skipped for Win32, even though the bug there it was avoiding
has been fixed in XP.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Several different test files need to find the locales on the system, and
each currently has rolled its own methods to do that. This commit
creates a new file t/loc_tools.pl which is designed to be a common
place for these tools.
lib/locale.t did the most thorough job of finding locales, so
t/loc_tools.pl is built upon what it had, which is now deleted from
locale.t.
The code in t/loc_tools.pl was copied from lib/locale.t with white space
changes and changes to make this be a subroutine, and helper functions
renamed to begin with an underscore, and changing the hard-coded list to
be in a DATA section so it doesn't have to be actually used unless
necessary.
|
|
|
|
|
|
| |
locale.t has changed so if tests in some locales fail, it still passes,
provided that most locales work. Thus this code whose effect was to
know about some broken locales and SKIP them, is no longer needed.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until now, the behavior of the statements
use warnings "FATAL";
use warnings "NONFATAL";
no warnings "FATAL";
was unspecified and inconsistent. This change causes them to be handled
with an implied "all" at the end of the import list.
Tony Cook: fix AUTHORS formatting
|
|
|
|
|
|
| |
We're getting newlines in between items, and the easiest way to
deal with it is make them explicit so we expect what we're getting
and it's done the same everywhere.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the tests were run with the following config:
NonStop=0 TTY=db.out LineInfo=db.out
This meant that the debugger would write the prologue, command prompts
and their results and the epilogue to one handle, and any line trace
information to the second handle. Since those handles didn't share a
file position, the line trace info would overwrite the prologue, and
the epilogue would overwrite part of the line trace info.
When TTY=vt100 on Redhat systems this made the epilogue just long
enough to overwrite the line trace data that a test matched against,
causing the test to fail.
To fix this, I avoided setting LineInfo:
NonStop=0 TTY=db.out
and since LineInfo defaults to using the TTY handle, both types of
content are written to db.out *without* overwriting each other.
Unfortunately this broke some other tests, since the command prompts
which were overwritten by line trace information are now mixed in with
the line traces - I've modified the tests that failed to account for
the included command lines.
|
| |
|