| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
(cherry picked from commit 0cebf65582f924952bfee1472749d442d51e43e6)
|
|
|
|
|
|
| |
_is_utf8__perl_idstart is not an API function, so the short
_is_utf8__perl_idstart form cannot be used in public macros.
The long form (Perl__is_utf8__perl_idstart) must be used.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
On HP-UX 10.20 in the HP C-ANSI-C environment
CAT2(macro, _A)
expands to
macro01
as _A obviously expands to 01. This fix "breaks" the token
|
|
|
|
|
|
|
|
|
|
| |
It's much more likely that a random character will have its ordinal be
above the ordinal for '7' than below. In the test for if a character is
octal then, testing first if it is <= '7' will exclude many more
possibilities than if the first test is if it is >= '0'.
I left the ones for lowercase letters in the same order, because, in
ASCII, anyway, there are more characters below 'a' than above it.
|
| |
|
|
|
|
|
|
| |
This has the incorrect definition, allowing 8 and 9, for programs that
don't include perl.h. Likely no one actually uses this recently added
macro who doesn't also include perl.h.
|
| |
|
|
|
|
|
| |
Unoptimized, the new definition takes signficantly fewer machine
instructions than the old one
|
|
|
|
|
| |
This is clearer, and leads to better unoptimized code at least.
'bar' is a boolean
|
|
|
|
|
|
| |
This now takes advantage of the new table that mktables generates
to find out if a character is a legal start character in Perl's
definition. Previously, it had to be looked up in two tables.
|
| |
|
|
|
|
| |
This macro is in the pod, but never got defined.
|
|
|
|
|
|
| |
This patch avoids the overhead of calling eg. is_utf8_alpha() on Latin1
inputs. The result is known to Perl's core, and this can avoid a swash
load.
|
|
|
|
|
|
| |
This patch avoids the overhead of calling eg. is_utf8_alpha() on ASCII
inputs. The result is known to Perl's core, and this can avoid a swash
load.
|
|
|
|
|
| |
This patch avoids the overhead of calling eg. is_uni_alpha() if the
result is known to Perl's core. This can avoid a swash load.
|
|
|
|
|
|
|
|
|
| |
Unicode stability policy guarantees that no code points will ever be
added to the control characters beyond those already in it.
All such characters are in the Latin1 range, and so the Perl core
already knows which ones those are, and so there is no need to go out to
disk and create a swash for these.
|
|
|
|
|
|
|
| |
Only the characters whose ordinals are 0-127 are ASCII. This is
trivially computed by the macro, so no need to call is_uni_ascii() to do
this. Also, since ASCII characters are the same when represented in
utf8 or not, the utf8 function call is also superfluous.
|
|
|
|
|
|
|
|
|
|
|
| |
Thus retains essentially the same definition for EBCDIC platforms, but
substitutes a simpler one for ASCII platforms. On my system, the new
definition compiles to about half the assembly instructions that the old
one did (non-optimized)
A bomb-proof definition of ASCII is to make sure that the value is
unsigned in the largest possible unsigned for the platform so there is
no possible loss of information, and then the ord must be < 128.
|
|
|
|
|
|
| |
This creates a #define for the platforms widest UV, and then uses this
in the FITS_IN_8ITS definition, instead of #ifdef'ing that. This will
be useful in future commits.
|
| |
|
|
|
|
|
|
|
|
|
| |
This means that the core uses the compiler's bool type if one exists.
This avoids potential problems of clashes between perl's own implementation
of bool and the compiler's bool type, which otherwise occur when one
attempts to include headers which in turn include <stdbool.h>.
Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously this used a home-grown definition of an identifier start,
stemming from a bug in some early Unicode versions. This led to some
problems, fixed by #74022.
But the home-grown solution did not track Unicode, and allowed for
characters, like marks, to begin words when they shouldn't. This change
brings this macro into compliance with Unicode going-forward.
|
| |
|
|
|
|
|
| |
Now the contents of l1_char_class_tab.h is only the output of
Porting/mk_PL_charclass.pl
|
| |
|
| |
|
|
|
|
| |
Sorry for the huge config_h.SH re-order. Don't know (yet) what caused that
|
|
|
|
|
|
|
|
|
|
|
| |
Some ANYOF regnodes have the ANYOF_UNICODE_ALL flag set, which means
they match any non-Latin1 character. These should match /i (in a utf8
target string) any ASCII or Latin1 character that folds outside the
Latin1 range
As part of this patch, an internal only macro is renamed to account for its
new use in regexec.c. The cumbersome name is to ward off others from
using it until the final semantics have been settled on.
|
|
|
|
|
|
|
|
|
| |
This creates a new macro for use by regcomp to test the new bit
regarding non-ascii folds.
Because the semantics may change in the future to deal with multi-char
folds, the name of the macro is unwieldy and specific enough that no one
should be tempted to use it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the definition of isIDFIRST_utf8 to avoid any characters
that would put the parser in a loop.
isIDFIRST_utf8 is used all over the place in toke.c. Almost every
instance is followed by a call to S_scan_word. S_scan_word is only
called when it is known that there is a word to scan.
What was happening was that isIDFIRST_utf8 would accept a character,
but S_scan_word in toke.t would then reject it, as it was using
is_utf8_alnum, resulting in an infinite number of zero-length
identifiers.
Another possible solution was to change S_scan_word to use
isIDFIRST_utf8 or similar, but that has back-compatibility problems,
as it stops q·foo· from being a strings and makes it an identi-
fier instead.
|
|
|
|
|
| |
Anywhere an API function takes a string in pvn form, ensure that there
are corresponding pv, pvs, and sv APIs.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The recent series of commits on handy.h causes x2p to not compile.
These commits had some differences from what I submitted, in that they
moved the new table to a new header file instead of the submitted
perl.h. Unfortunately, this bypasses code in perl.h that figures
out about duplicate definitions, and externs, and so fails on programs
that include handy.h but not perl.h.
This patch changes things so that the table lookup is not used unless
perl.h is included. This is essentially my original patch, but adding
an #include of the new header file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds *_L1() macros for character class lookup, using table
lookup for O(1) performance. These force a Latin-1 interpretation on
ASCII platforms.
There were a couple existing macros that had the suffix U for Unicode
semantics. I thought that those names might be confusing, so settled on
L1 as the least bad name. The older names are kept as synonyms for
backward compatibility. The problem with those names is that these are
actually macros, not functions, and hence can be called with any int,
including any Unicode code point. The U suffix might be mistaken for
indicating they are more general purpose, whereas they are really only
valid for the latin1 subset of Unicode (including the EBCDIC isomorphs).
When called with something outside the latin1 range, they will return
false.
This patch necessitated rearranging a few things in the file. I added
documentation for several more macros, and intend to document the rest.
(This commit was modified from its original form by Steffen.)
|
|
|
|
|
| |
This macro is clearer as to intent over isALNUM, and isn't confusable
with isALNUMC. So document it primarily.
|
| |
|
|
|
|
|
| |
There are a number of macros missing from the documentation. This helps
me figure out which ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch changes the macros whose names end in _A to use table lookup
except for the one (isASCII) which always has only one comparison.
The table is in l1_char_class_tab.h.
The advantage of this is speed. It replaces some fairly complicated
expressions with an O(1) look-up and a mask.
It uses the FITS_IN_8_BITS() macro to guarantee that the table bounds
are not exceeded. For legal inputs that are byte size, the optimizer
should get rid of this macro leaving only the lookup and mask.
(This commit was changed from its original form by Steffen.)
|
| |
|
|
|
|
| |
These macros return true only if the parameter is an ASCII character.
|
|
|
|
| |
as is better optimized and suitable for the purpose.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The name isALNUM() is problematic, as it is very close to isALNUMC(),
and doesn't mean exactly what most people might think. I presume the C
in isALNUMC stands for C language or libc, but am not sure. Others
don't know either. But in any event, isALNUM is different from the C
isalnum(), in that it matches the Perl concept of \w, which differs from
the C definition in exactly one place. Perl includes the underscore
character, '_'.
So, I'm adding a isWORDCHAR() macro for future code to use to be more
clear. I thought also about isWORD(), but I think confusion can arise
from thinking that means a whole word. isWORDCHAR_L1() matches in the
Latin1 range, to be equivalent to isALNUMU(). The motivation for using
L1 instead of U will be explained in a commit message for the other L1
macros that are to be added.
|
| |
|
| |
|
|
|
|
|
| |
The only change here is that I sorted these #defines within their
groups, to make it much easier to follow what's going on.
|
|
|
|
| |
It didn't include the Latin1 space components.
|
|
|
|
| |
It doesn't include NBSP
|