| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
There doesn't need to be a quantifier or capturing on this regex.
|
|
|
|
|
|
|
| |
This adds the capability to get input to this program from another
program, thus allowing essentially arbitrary input.
This will be used in future commits.
|
|
|
|
|
|
| |
Karl Williamson noticed that we dont always deal with common suffixes in
the most efficient way. This change reworks how we convert a trie to an
optree so that common suffixes are always grouped together.
|
| |
|
|
|
|
|
|
|
|
|
| |
It is not possible to know how to interpret the returned length
without accessing the UTF8 flag, which is not reliable until
the SV has been stringified, which requires get-magic. So length
magic has not made senses since utf8 support was added. I have
removed all uses of length magic from the core, so this is now
dead code.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
We dont have any easy way to test regen/regcharclass.pl currently.
Perl #115078 is related to a bug in the _cleanup() routine which is
fixed with next patch.
|
|
|
|
|
|
| |
fm magic uses want_vtbl_fm, which is #defined as want_vtbl_regexp.
The definition in regen/mg_vtable.pl does not affect anything except
the documentation. It was listed as using regdata which was wrong.
|
|
|
|
|
|
|
|
|
|
|
| |
The easiest way to fix this was to move the special handling out of
the regexp engine. Now a flag is set on the split op itself for
this case. A real regexp is still created, as that is the most
convenient way to propagate locale settings, and it prevents the
need to rework pp_split to handle a null regexp.
This also means that custom regexp plugins no longer need to handle
split specially (which they all do currently).
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
I reindented the tree in perllexwarn because I was simply copying and
pasting the output from:
perl regen/warnings.pl tree
|
|
|
|
|
|
| |
This will be used for cloning a ‘my’ sub on scope entry.
I was going to use pp_padcv for this, but it would end up having a
top-level if/else.
|
|
|
|
|
| |
This will be used for introducing ‘my’ subs on scope entry, by turning
off the stale flag.
|
|
|
|
|
|
|
|
| |
This will be used for storing the prototype CV of a ‘my’ sub. The
clone needs to occupy the pad entry so that padcv ops will be able to
find it. That means the clone has to displace its prototype. In case
the same sub is called recursively, we still need to be able to access
the prototype.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some warnings, such as deprecation warnings, are on by default:
$ perl5.16.0 -e '$*'
$* is no longer supported at -e line 1.
But turning *on* other warnings will turn them off:
$ perl5.16.0 -e 'use warnings "void"; $*'
Useless use of a variable in void context at -e line 1.
Either all warnings in any given scope are controlled by lexical
hints, or none of them are.
When a single warnings category is turned on or off, if the warn-
ings were controlled by $^W, then all warnings are first turned on
lexically if $^W is 1 and all warnings are turned off lexically
if $^W is 0.
That has the unfortunate affect of turning off warnings when it was
only requested that warnings be turned on.
These categories contain default warnings:
ambiguous
debugging
deprecated
inplace
internal
io
malloc
utf8
redefine
syntax
glob
inplace
overflow
precedence
prototype
threads
misc
Most also contain regular warnings, but these contain *only*
default warnings:
debugging
deprecated
glob
inplace
malloc
So we can treat $^W==0 as equivalent to qw(debugging deprecated glob
inplace malloc) when enabling lexical warnings.
While this means that some default warnings will still be turned off
by ‘use warnings "void"’, it won’t be as many as before. So at least
this is a step in the right direction.
(The real solution, of course, is to allow each warning to be turned
off or on on its own.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes the output of regen/regcharclass.pl for all the 1-4 byte
UTF8-representations of Unicode code points, and replaces the current
hand-rolled definition there. It does this only for ASCII platforms,
leaving EBCDIC to be machine generated when run on such a platform.
I would rather have both versions to be regenerated each time it is
needed to save an EBCDIC dependency, but it takes more than 10 minutes
on my computer to process the 2 billion code points that have to be
checked for on ASCII platforms, and currently t/porting/regen.t runs
this program every times; and that slow down would be unacceptable. If
this is ever run under EBCDIC, the macro should be machine computed
(very slowly). So, even though there is an EBCDIC dependency, it has
essentially been solved.
|
|
|
|
|
| |
This adds the capability to skip definitions if they are for other than
a desired platform.
|
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl has been enhanced in previous commits so that it
generates as good code as these hand-defined macro definitions for
various UTF-8 constructs. And, it should be able to generate EBCDIC
ones as well. By using its definitions, we can remove the EBCDIC
dependencies for them. It is quite possible that the EBCDIC versions
were wrong, since they have never been tested. Even if
regcharclass.pl has bugs under EBCDIC, it is easier to find and fix
those in one place, than all the sundry definitions.
|
|
|
|
|
|
| |
On UTF-8 input known to be valid, continuation bytes must be in the
range 0x80 .. 0x9F. Therefore, any tests for being within those bounds
will always be true, and may be omitted.
|
|
|
|
| |
Indent a newly-formed block
|
|
|
|
|
|
|
|
|
| |
A previous commit added an optimization to save a branch in the
generated code at the expense of an extra mask when the input class has
certain characteristics. This extends that to the case where
sub-portions of the class have similar characteristics. The first
optimization for the entire class is moved to right before the new loop
that checks each range in it.
|
|
|
|
|
|
| |
This adds a test and returns 1 from a subroutine if the condition will
always match; and in the caller it adds a check for that, and omits the
condition from the generated macro.
|
|
|
|
|
|
| |
Branches can be eliminated from the macros that are generated here
by using a mask in cases where applicable. This adds checking to see if
this optimization is possible, and applies it if so.
|
|
|
|
| |
I find it confusing that the array element name is the same as the full array
|
|
|
|
|
| |
This is to prepare for future commits which will act differently at the
deep level depending on some of the options.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The rules for matching whether an above-Latin1 code point are now saved
in a macro generated from a trie by regen/regcharclass.pl, and these are
now used by pp.c to test these cases. This allows removal of a wrapper
subroutine, and also there is no need for dynamic loading at run-time
into a swash.
This macro is about as big as I'm comfortable compiling in, but it
saves the building of a hash that can grow over time, and removes a
subroutine and interpreter variables. Indeed, performance benchmarks
show that it is about the same speed as a hash, but it does not require
having to load the rules in from disk the first time it is used.
|
|
|
|
|
|
| |
The new type 'high' is used on only above-Latin1 code points. It is
designed for code that already knows the tested code point is not
Latin1, and avoids unnecessary tests.
|
| |
|
|
|
|
|
| |
This makes sure that the modifiers specified in the input are known to
the program.
|
|
|
|
|
|
| |
Lines whose first non-blank character is a '#' are now considered to be
comments, and ignored. This allows the moving of some lines that have
been commented out back to after the __DATA__ where they really belong.
|
|
|
|
|
|
|
| |
A future commit will want to use the first surrogate code point's UTF-8
value. Add this to the generated macros, and give it a name, since
there is no official one. The program has to be modified to cope with
this.
|
|
|
|
|
|
|
|
|
|
| |
A previous commit has caused macros to be generated that will match
Unicode code points of interest to the \X algorithm. This patch uses
them. This speeds up modern Korean processing by 15%.
Together with recent previous commits, the throughput of modern Korean
under \X has more than doubled, and is now comparable to other
languages (which have increased themselved by 35%)
|
|
|
|
|
|
|
| |
\X is implemented in regexec.c as a complicated series of property
look-ups. It turns out that many of those are for just a few code
points, and so can be more efficiently implemented with a macro than a
swash. This generates those.
|
|
|
|
|
|
| |
Future commits will add Unicode properties for this to generate macros,
and some of them may be empty in some Unicode releases. This just
causes such a generated macro to evaluate to 0.
|
|
|
|
|
|
| |
The character '0' could be omitted from some generated macros due to
it's testing the value of a hash entry (getting 0 or false) instead
of if it exists or not.
|
|
|
|
|
|
|
|
|
| |
This will now automatically generate macros for non-ASCII platforms,
by mapping the Unicode input to native output.
Doing this will allow several cases of EBCDIC dependencies in other code
to be removed, and fixes the bug that this previously had with non-ASCII
platforms.
|
|
|
|
| |
Newer options to unpack alleviate the need for Encode, and run faster.
|
|
|
|
|
|
|
|
| |
Instead of having to list all code points in a class, you can now use
\p{} or a range.
This changes some classes to use the \p{}, so that any changes Unicode
makes to the definitions don't have to manually be done here as well.
|
|
|
|
|
| |
The recently added utf8_strings.h has been expanded to include more than
just strings. I'm renaming it to avoid confusion.
|