| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Like regen/mk_invlists.pl, if any of various Unicode-related files
change, we can't rely on the generated file remaining unchanged.
|
|
|
|
|
| |
A previous commit changed how \X is implemented, and now we don't need
these anymore.
|
|
|
|
|
| |
and check that checksum in t/porting/regen.t. This makes the tests
run faster.
|
| |
|
|
|
|
|
| |
It's very rare actually for code to be presented with malformed UTF-8,
so give the compiler a hint about the likely branches.
|
|
|
|
|
|
|
|
| |
This allows for an efficient isUTF8_CHAR macro, which does its own
length checking, and uses the UTF8_INVARIANT macro for the first byte.
On EBCDIC systems this macro which does a table lookup is quite a bit
more efficient than all the branches that would normally have to be
done.
|
|
|
|
|
| |
This adds a new macro generation option for inputs that are checked
elsewhere for buffer overflow, but otherwise needs validity checks.
|
|
|
|
|
| |
This causes the generated regcharclass.h to be valid on all
supported platforms
|
|
|
|
|
| |
This is because a future commit will execute this code multiple times,
and the library file should only be read once.
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit c4c8e61502fd5289a080f20332c6e3f9f23ce6e2.
It turns out that this scheme to bootstrap regcharclass.h onto a machine
not running ASCII created too much manual labor getting things to work.
A better solution is to cross compile on an ASCII machine for the
target. Commit 6ff677df5d6fe0f52ca0b6736f8b5a46ac402943 created the
infrastructure to do that, and this commit starts the process of
changing regen/regcharclass.pl to use that.
|
|
|
|
|
|
|
|
| |
This is a small improvement when a consecutive group of U8 code points
begins at 0 or ends at 255. These end points are physically impossible
of being exceeded, so there is no need to test for that end of the
range. In several places this causes a mask operation to not be
generated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This brings Perl regular expressions more into conformance with Unicode.
/x now accepts 5 additional characters as white space. Use of these
characters as literals under /x has been deprecated since 5.18, so now
we are free to change what they mean.
This commit eliminates the static function that processes the old
whitespace definition (and a generated macro that was used only for
this), using the already existing one for the new definition. It
refactors slightly the static function that skips comments to mesh
better with the needs of its callers, and calls it in one place where
before the code was essentially duplicated.
p5p discussion starting in
http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that
the (?[ ]) comments should be terminated the same way as regular /x
comments, and this was also done in this commit. No prior notice is
necessary as this is an experimental feature.
|
|
|
|
|
|
|
|
| |
In doing an audit of regcomp.c, and experimenting using
Encode::_utf8_on(), I found this one instance of a regen/regcharclass.pl
macro that could read beyond the end of the string if given malformed
UTF-8. Hence we convert to use the 'safe' form. There are no other
uses of the non-safe version, so don't need to generate them.
|
|
|
|
| |
Having these unused macros around just clutters up the header file
|
|
|
|
|
|
| |
The macros generated by these options are not needed in the core;
generating them just clutters up the header file, and some will actually
be forbidden by the next commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
My thinking was muddled when I made that commit, and this reverts the
essence of it. The theory was that since we have already processed the
regex pattern, we don't need to check it for malformedness, hence we
don't need the "safe" form of certain macros that check for and avoid
running off the end of the buffer. It is true that we don't have to
worry about malformedness indicating that the buffer is bigger than it
really is, but these macros can match up to three well-formed
characters, so we do have to make sure that all three are in the buffer,
and that the input isn't just the first two at the buffer's end.
This was caught by running valgrind.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This simplifies the macros generated which make sure there are no read
errors. Prior to this commit, the code generated looked like
(e - s) > 3
? see if things of at most length 4 match
: (e - s) > 2
? see if things of at most length 3 match
: (e - s) > 1
? see if things of at most length 2 match
: (e - s) > 0
? see if things of at most length 1 match
For things that are a single character, the ones greater than length 2
must be in UTF8, and their needed length can be determined by UTF8SKIP,
so we can get rid of most of the (e-s) tests.
This doesn't change the macros which can match multiple characters; that
is a harder to do.
|
|
|
|
|
| |
This adds a comment to the generated file that the macros are not to be
generally used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl can create macros for use where we need to worry
about the possibility of malformed UTF-8, and for where we don't. In
the case of looking at regex patterns, the Perl core has complete
control over generating them, and hence isn't generally going to create
too short a buffer; if it does, it's a bug that will show up and get
fixed. This commit changes to generate and use the faster macros that
don't do bounds checking.
|
|
|
|
|
|
|
| |
Prior to this patch, this was in regen/mk_invlists.pl, but future
commits will want it to also be used by the header generated by
regen/regcharclass.pl, so use a common source so the logic doesn't have
to be duplicated.
|
| |
|
|
|
|
|
|
|
|
| |
This commit changes the code generated by the macros so that they work
right out-of-the-box on non-ASCII platforms for non-UTF-8 inputs. THEY
ARE WRONG for UTF-8, but this is good enough to get perl bootstrapped
onto the target platform, and regcharclass.pl can be run there,
generating macros with correct UTF-8.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
use locale;
fc("\N{LATIN CAPITAL LETTER SHARP S}")
eq 2 x fc("\N{LATIN SMALL LETTER LONG S}")
should return true, as the SHARP S folds to two 's's in a row, and the
LONG S is an antique variant of 's', and folds to s. Until this commit,
the expression was false.
Similarly, the following should match, but didn't until this commit:
"\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa
The reason these didn't work properly is that in both cases the actual
fold to 's' is disallowed. In the first case because of locale; and in
the second because of /aa. And the code wasn't smart enough to realize
that these were legal.
The fix is to special case these so that the fold of sharp s (both
capital and small) is two LONG S's under /aa; as is the fold of the
capital sharp s under locale. The latter is user-visible, and the
documentation of fc() now points that out. I believe this is such an
edge case that no mention of it need be done in perldelta.
|
|
|
|
| |
This will be used to deprecate uses of non-ASCII Pattern White Space
|
|
|
|
| |
This Unicode property will be used in future commits
|
|
|
|
|
| |
I was re-reading some code and got confused. This table matches just
the first character of a sequence that may or may not contain others.
|
|
|
|
|
|
| |
Some compilers can't handle unexpanded macros longer than something
like 8000 characters. So we split up long ones into sub macros to work
around the problem
|
|
|
|
|
|
|
|
|
|
|
| |
This will avoid loading a swash when an above Latin1 code point is
tested to see if it matches \s. The SPACE macro is quite small, and
unlikely to grow over time, as Unicode has mostly finished adding white
space equivalents to the Standard.
The CCC_TRY_U macro in regexec.c could not be used for this, and I just
expanded out what it would generate, modified to use the macro instead
of a swash.
|
|
|
|
|
|
|
|
|
|
| |
This adds macros to regen/regcharclass.pl that are usable as part of the
is_XDIGIT_foo() macros in handy.h, so that no function call need be done
to handle above Latin1 input. These macros are quite small, and
unlikely to grow over time. The functions that implement these in
utf8.c are also changed to use the macros instead of generating a swash.
This should speed things up slightly, with less memory used over time as
the swash fills.
|
|
|
|
|
|
|
|
|
|
|
| |
This adds macros to regen/regcharclass.pl that are usable as part of the
is_BLANK_foo() macros in handy.h, so that no function call need be done
to handle above Latin1 input. These macros are quite small, and
unlikely to grow over time, as Unicode has mostly finished adding white
space equivalents to the Standard. The functions that implement these
in utf8.c are also changed to use the macros instead of generating a
swash. This should speed things up slightly, with less memory used over
time as the swash fills.
|
|
|
|
|
| |
These two macros match the same things as \v does in patterns. I'm
leaving them undocumented for now.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sayeth Karl:
In the _cp macros, the final test can be simplified:
/*** GENERATED CODE ***/
#define is_VERTWS_cp(cp) \
( ( 0x0A <= cp && cp <= 0x0D ) || ( 0x0D < cp && \
( 0x85 == cp || ( 0x85 < cp && \
( 0x2028 == cp || ( 0x2028 < cp && \
0x2029 == cp ) ) ) ) ) )
That 0x2028 < cp can be omitted and it will still mean the same thing.
And So Be It.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit revamps the recently added function calculate_mask() to not
just work to give a single mask/compare value for its input and fail if
there are none, but to return a list of masks/compares when the set can
be split up into subsets that each can be represented by a mask/compare.
If this list taken as a whole yields fewer branches than what we get
otherwise, it is better code, and is used.
Said another way, what we had there before was all or nothing; this
works to improve things even if we can't do it all.
|
|
|
|
|
|
|
|
| |
This changes the macro isMULTI_CHAR_FOLD() (non-utf8 version) from just
generating ascii-range code points to generating the full Latin1 range.
However there are no such non-ASCII values, so the macro expansion is
unchanged. By changing the name, it becomes clearer in future commits
that we aren't excluding things that we should be considering.
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
|
| |
Karl Williamson noticed that we dont always deal with common suffixes in
the most efficient way. This change reworks how we convert a trie to an
optree so that common suffixes are always grouped together.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
We dont have any easy way to test regen/regcharclass.pl currently.
Perl #115078 is related to a bug in the _cleanup() routine which is
fixed with next patch.
|
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl has been enhanced in previous commits so that it
generates as good code as these hand-defined macro definitions for
various UTF-8 constructs. And, it should be able to generate EBCDIC
ones as well. By using its definitions, we can remove the EBCDIC
dependencies for them. It is quite possible that the EBCDIC versions
were wrong, since they have never been tested. Even if
regcharclass.pl has bugs under EBCDIC, it is easier to find and fix
those in one place, than all the sundry definitions.
|
|
|
|
|
|
| |
On UTF-8 input known to be valid, continuation bytes must be in the
range 0x80 .. 0x9F. Therefore, any tests for being within those bounds
will always be true, and may be omitted.
|
|
|
|
|
|
|
|
|
| |
A previous commit added an optimization to save a branch in the
generated code at the expense of an extra mask when the input class has
certain characteristics. This extends that to the case where
sub-portions of the class have similar characteristics. The first
optimization for the entire class is moved to right before the new loop
that checks each range in it.
|
|
|
|
|
|
| |
Branches can be eliminated from the macros that are generated here
by using a mask in cases where applicable. This adds checking to see if
this optimization is possible, and applies it if so.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The rules for matching whether an above-Latin1 code point are now saved
in a macro generated from a trie by regen/regcharclass.pl, and these are
now used by pp.c to test these cases. This allows removal of a wrapper
subroutine, and also there is no need for dynamic loading at run-time
into a swash.
This macro is about as big as I'm comfortable compiling in, but it
saves the building of a hash that can grow over time, and removes a
subroutine and interpreter variables. Indeed, performance benchmarks
show that it is about the same speed as a hash, but it does not require
having to load the rules in from disk the first time it is used.
|
|
|
|
|
|
|
| |
\X is implemented in regexec.c as a complicated series of property
look-ups. It turns out that many of those are for just a few code
points, and so can be more efficiently implemented with a macro than a
swash. This generates those.
|
|
|
|
|
|
|
|
| |
Instead of having to list all code points in a class, you can now use
\p{} or a range.
This changes some classes to use the \p{}, so that any changes Unicode
makes to the definitions don't have to manually be done here as well.
|
|
|
|
|
|
| |
Future commits will have other headers #include the headers generated by
these programs. It is best to guard against the preprocessor from
trying to process these twice
|