| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some warnings, such as deprecation warnings, are on by default:
$ perl5.16.0 -e '$*'
$* is no longer supported at -e line 1.
But turning *on* other warnings will turn them off:
$ perl5.16.0 -e 'use warnings "void"; $*'
Useless use of a variable in void context at -e line 1.
Either all warnings in any given scope are controlled by lexical
hints, or none of them are.
When a single warnings category is turned on or off, if the warn-
ings were controlled by $^W, then all warnings are first turned on
lexically if $^W is 1 and all warnings are turned off lexically
if $^W is 0.
That has the unfortunate affect of turning off warnings when it was
only requested that warnings be turned on.
These categories contain default warnings:
ambiguous
debugging
deprecated
inplace
internal
io
malloc
utf8
redefine
syntax
glob
inplace
overflow
precedence
prototype
threads
misc
Most also contain regular warnings, but these contain *only*
default warnings:
debugging
deprecated
glob
inplace
malloc
So we can treat $^W==0 as equivalent to qw(debugging deprecated glob
inplace malloc) when enabling lexical warnings.
While this means that some default warnings will still be turned off
by ‘use warnings "void"’, it won’t be as many as before. So at least
this is a step in the right direction.
(The real solution, of course, is to allow each warning to be turned
off or on on its own.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes the output of regen/regcharclass.pl for all the 1-4 byte
UTF8-representations of Unicode code points, and replaces the current
hand-rolled definition there. It does this only for ASCII platforms,
leaving EBCDIC to be machine generated when run on such a platform.
I would rather have both versions to be regenerated each time it is
needed to save an EBCDIC dependency, but it takes more than 10 minutes
on my computer to process the 2 billion code points that have to be
checked for on ASCII platforms, and currently t/porting/regen.t runs
this program every times; and that slow down would be unacceptable. If
this is ever run under EBCDIC, the macro should be machine computed
(very slowly). So, even though there is an EBCDIC dependency, it has
essentially been solved.
|
|
|
|
|
| |
This adds the capability to skip definitions if they are for other than
a desired platform.
|
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl has been enhanced in previous commits so that it
generates as good code as these hand-defined macro definitions for
various UTF-8 constructs. And, it should be able to generate EBCDIC
ones as well. By using its definitions, we can remove the EBCDIC
dependencies for them. It is quite possible that the EBCDIC versions
were wrong, since they have never been tested. Even if
regcharclass.pl has bugs under EBCDIC, it is easier to find and fix
those in one place, than all the sundry definitions.
|
|
|
|
|
|
| |
On UTF-8 input known to be valid, continuation bytes must be in the
range 0x80 .. 0x9F. Therefore, any tests for being within those bounds
will always be true, and may be omitted.
|
|
|
|
| |
Indent a newly-formed block
|
|
|
|
|
|
|
|
|
| |
A previous commit added an optimization to save a branch in the
generated code at the expense of an extra mask when the input class has
certain characteristics. This extends that to the case where
sub-portions of the class have similar characteristics. The first
optimization for the entire class is moved to right before the new loop
that checks each range in it.
|
|
|
|
|
|
| |
This adds a test and returns 1 from a subroutine if the condition will
always match; and in the caller it adds a check for that, and omits the
condition from the generated macro.
|
|
|
|
|
|
| |
Branches can be eliminated from the macros that are generated here
by using a mask in cases where applicable. This adds checking to see if
this optimization is possible, and applies it if so.
|
|
|
|
| |
I find it confusing that the array element name is the same as the full array
|
|
|
|
|
| |
This is to prepare for future commits which will act differently at the
deep level depending on some of the options.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The rules for matching whether an above-Latin1 code point are now saved
in a macro generated from a trie by regen/regcharclass.pl, and these are
now used by pp.c to test these cases. This allows removal of a wrapper
subroutine, and also there is no need for dynamic loading at run-time
into a swash.
This macro is about as big as I'm comfortable compiling in, but it
saves the building of a hash that can grow over time, and removes a
subroutine and interpreter variables. Indeed, performance benchmarks
show that it is about the same speed as a hash, but it does not require
having to load the rules in from disk the first time it is used.
|
|
|
|
|
|
| |
The new type 'high' is used on only above-Latin1 code points. It is
designed for code that already knows the tested code point is not
Latin1, and avoids unnecessary tests.
|
| |
|
|
|
|
|
| |
This makes sure that the modifiers specified in the input are known to
the program.
|
|
|
|
|
|
| |
Lines whose first non-blank character is a '#' are now considered to be
comments, and ignored. This allows the moving of some lines that have
been commented out back to after the __DATA__ where they really belong.
|
|
|
|
|
|
|
| |
A future commit will want to use the first surrogate code point's UTF-8
value. Add this to the generated macros, and give it a name, since
there is no official one. The program has to be modified to cope with
this.
|
|
|
|
|
|
|
|
|
|
| |
A previous commit has caused macros to be generated that will match
Unicode code points of interest to the \X algorithm. This patch uses
them. This speeds up modern Korean processing by 15%.
Together with recent previous commits, the throughput of modern Korean
under \X has more than doubled, and is now comparable to other
languages (which have increased themselved by 35%)
|
|
|
|
|
|
|
| |
\X is implemented in regexec.c as a complicated series of property
look-ups. It turns out that many of those are for just a few code
points, and so can be more efficiently implemented with a macro than a
swash. This generates those.
|
|
|
|
|
|
| |
Future commits will add Unicode properties for this to generate macros,
and some of them may be empty in some Unicode releases. This just
causes such a generated macro to evaluate to 0.
|
|
|
|
|
|
| |
The character '0' could be omitted from some generated macros due to
it's testing the value of a hash entry (getting 0 or false) instead
of if it exists or not.
|
|
|
|
|
|
|
|
|
| |
This will now automatically generate macros for non-ASCII platforms,
by mapping the Unicode input to native output.
Doing this will allow several cases of EBCDIC dependencies in other code
to be removed, and fixes the bug that this previously had with non-ASCII
platforms.
|
|
|
|
| |
Newer options to unpack alleviate the need for Encode, and run faster.
|
|
|
|
|
|
|
|
| |
Instead of having to list all code points in a class, you can now use
\p{} or a range.
This changes some classes to use the \p{}, so that any changes Unicode
makes to the definitions don't have to manually be done here as well.
|
|
|
|
|
| |
The recently added utf8_strings.h has been expanded to include more than
just strings. I'm renaming it to avoid confusion.
|
|
|
|
|
|
|
| |
This adds a new capability to this program: to input a Unicode code point and
create a macro that expands to the platform's native value for it.
This will allow removal of a bunch of EBCDIC dependencies in the core.
|
|
|
|
|
|
| |
An input line without a command is considered to be a request for the
UTF-8 encoded string of the code point. This allows an explicit
'string' to be used.
|
|
|
|
| |
This allows the generated .h to look better.
|
|
|
|
|
|
| |
Future commits will have other headers #include the headers generated by
these programs. It is best to guard against the preprocessor from
trying to process these twice
|
|
|
|
|
|
|
|
|
| |
This add regen/utf8_strings.pl takes Unicode characters and generates
utf8_strings.h to contains #defines for macros that translate from the
name to the UTF-8. This is needed in a few places, where previously
things were manually figured out and hard-coded in. Doing this instead
makes this easier, and removes EBCDIC dependencies/bugs, as the file
would simply be regen'd on an EBCDIC platform.
|
|
|
|
|
|
| |
Tricky folds have been removed from the code, so the removed #defines
are obsolete. I'm leaving this in, in so it can conveniently be
referred to in case we ever need it again.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since 6ea72b3a1, rv2hv and padhv have had the ability to return boo-
leans in scalar context, instead of bucket stats, if flagged the right
way. sub { %hash || ... } is optimised to take advantage of this. If
the || is in unknown context at compile time, the %hash is flagged as
being maybe a true boolean. When flagged that way, it returns a bool-
ean if block_gimme() returns G_VOID.
If rv2hv and padhv can already do this, then we don’t need the
boolkeys op any more. We can just flag the rv2hv to return a boolean.
In all the cases where boolkeys was used, we know at compile time that
it is true boolean context, so we add a new flag for that.
|
|
|
|
|
|
|
| |
Benchmarking showed some speed-up when the result of the previous
search in an inversion list is cached, thus potentially avoiding a
search in the next call. This adds a field to each inversion list which
caches its previous search result.
|
|
|
|
| |
This is useful for debugging, especially with -DT.
|
|
|
|
|
|
|
|
|
|
| |
This starts with the existing table that mktables generates that lists
all the characters in Unicode that occur in multi-character folds, and
aren't in the final positions of any such fold.
It generates data structures with this information to make it quickly
available to code that wants to use it. Future commits will use these
tables.
|
|
|
|
|
|
| |
This change adds the ability to specify that an output inversion list is
to contain only those code points that are above Latin-1. Typically,
the Latin-1 ones will be accessed from some other means.
|
|
|
|
| |
Octals are no longer checked via this mechanism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A substitution forces its target to a string upon successful substitu-
tion, even if the substitution did nothing:
$ ./perl -Ilib -le '$a = *f; $a =~ s/f/f/; print ref \$a'
SCALAR
Notice that $a is no longer a glob after s///.
But vstrings are different:
$ ./perl -Ilib -le '$a = v102; $a =~ s/f/f/; print ref \$a'
VSTRING
I fixed this in 5.16 (1e6bda93) for those cases where the vstring ends
up with a value that doesn’t correspond to the actual string:
$ ./perl -Ilib -le '$a = v102; $a =~ s/f/o/; print ref \$a'
SCALAR
It works through vstring set-magic, that does the check and removes
the magic if it doesn’t match.
I did it that way because I couldn’t think of any other way to fix
bug #29070, and I didn’t realise at the time that I hadn’t fixed
all the bugs.
By making SvTHINKFIRST true on a vstring, we force it through
sv_force_normal before any in-place string operations. We can also
make sv_force_normal handle vstrings as well. This fixes all the lin-
gering-vstring-magic bugs in just two lines, making the vstring set-
magic (which is also slow) redundant. It also allows the special case
in sv_setsv_flags to be removed.
Or at least that was what I had hoped.
It turns out that pp_subst, twists and turns in tortuous ways, and
needs special treatment for things like this.
And do_trans functions wasn’t checking SvTHINKFIRST when arguably it
should have.
I tweaked sv_2pv{utf8,byte} to avoid copying magic variables that do
not need copying.
|
|
|
|
|
|
| |
ck_chdir, added in 2006 (d4ac975e) duplicates ck_trunc, added in
1993 (79072805), except for a null op check which is harmless when
applied to chdir.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This array is a bit map containing the Posix and similar character
classes for the first 256 code points. Prior to this commit many
character classes were represented by two bits, one for characters that
are in it over the full Latin-1 range, and one for just the ASCII
characters that are in it. The number of bits in use was approaching
the 32-bit limit available without playing games.
This commit takes advantage of a recent commit that adds a bit to the
table for all the ASCII characters, and the fact that the ASCII
characters in a character class are a subset of the full Latin1
range. So, iff both the full-range character class bit and the ASCII
bit is set is that character an ASCII-range character with the given
character class.
A new internal macro is created to generate code to determine if a
character is an ASCII range character with the given class. It's not
clear if the generated code is faster or slower than the full range
version.
The result is that nearly half the bits are freed up, as the ones for
the ASCII-range are now redundant.
|
|
|
|
|
|
| |
This changes the #defines to be just the shift number, while doing
the shifting in the macro that the number is passed to. This will prove
useful in future commits
|
|
|
|
|
|
|
| |
This does not replace the isASCII macro definition, as I think the
current one is more efficient than this one provides. But future
commits will rely on all the named character classes (e.g.,
/[[:ascii:]]/) having a bit, and this is the only one missing.
|
|
|
|
|
|
| |
If a comment contained a semi-colon, the regular expression's greedy
quantifier would think the portion of the comment before it was part of
the data to be processed
|
| |
|
|
|
|
|
|
|
|
|
| |
This will reduce the machine code size on Visual C Perl, by removing C stack
clean up opcodes and possible jmp opcodes after croak() and similar
functions. Perl's existing __attribute__noreturn__ macro (and therefore
GCC's __attribute__((noreturn)) ) is fundamentally incompatible with MS's
implementation for noreturn functions. win32.h already has _MSC_VER
aware code blocks, so adding more isn't a problem.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes the pessimistic approach of skipping it for any first argu-
ment that is not a plain non-magical PV, just in case there is a 'p'
or 'P' in the stringified form.
Otherwise it scans the PV for 'p' or 'P' and skips the folding if either
is present.
Then it falls through to the usual op-filtering logic.
I nearly made ‘pack;’ crash, so I added a test to bproto.t.
|
| |
|