| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously this used a home-grown definition of an identifier start,
stemming from a bug in some early Unicode versions. This led to some
problems, fixed by #74022.
But the home-grown solution did not track Unicode, and allowed for
characters, like marks, to begin words when they shouldn't. This change
brings this macro into compliance with Unicode going-forward.
|
| |
|
|
|
|
|
| |
Now the contents of l1_char_class_tab.h is only the output of
Porting/mk_PL_charclass.pl
|
| |
|
| |
|
|
|
|
| |
Sorry for the huge config_h.SH re-order. Don't know (yet) what caused that
|
|
|
|
|
|
|
|
|
|
|
| |
Some ANYOF regnodes have the ANYOF_UNICODE_ALL flag set, which means
they match any non-Latin1 character. These should match /i (in a utf8
target string) any ASCII or Latin1 character that folds outside the
Latin1 range
As part of this patch, an internal only macro is renamed to account for its
new use in regexec.c. The cumbersome name is to ward off others from
using it until the final semantics have been settled on.
|
|
|
|
|
|
|
|
|
| |
This creates a new macro for use by regcomp to test the new bit
regarding non-ascii folds.
Because the semantics may change in the future to deal with multi-char
folds, the name of the macro is unwieldy and specific enough that no one
should be tempted to use it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the definition of isIDFIRST_utf8 to avoid any characters
that would put the parser in a loop.
isIDFIRST_utf8 is used all over the place in toke.c. Almost every
instance is followed by a call to S_scan_word. S_scan_word is only
called when it is known that there is a word to scan.
What was happening was that isIDFIRST_utf8 would accept a character,
but S_scan_word in toke.t would then reject it, as it was using
is_utf8_alnum, resulting in an infinite number of zero-length
identifiers.
Another possible solution was to change S_scan_word to use
isIDFIRST_utf8 or similar, but that has back-compatibility problems,
as it stops q·foo· from being a strings and makes it an identi-
fier instead.
|
|
|
|
|
| |
Anywhere an API function takes a string in pvn form, ensure that there
are corresponding pv, pvs, and sv APIs.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The recent series of commits on handy.h causes x2p to not compile.
These commits had some differences from what I submitted, in that they
moved the new table to a new header file instead of the submitted
perl.h. Unfortunately, this bypasses code in perl.h that figures
out about duplicate definitions, and externs, and so fails on programs
that include handy.h but not perl.h.
This patch changes things so that the table lookup is not used unless
perl.h is included. This is essentially my original patch, but adding
an #include of the new header file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds *_L1() macros for character class lookup, using table
lookup for O(1) performance. These force a Latin-1 interpretation on
ASCII platforms.
There were a couple existing macros that had the suffix U for Unicode
semantics. I thought that those names might be confusing, so settled on
L1 as the least bad name. The older names are kept as synonyms for
backward compatibility. The problem with those names is that these are
actually macros, not functions, and hence can be called with any int,
including any Unicode code point. The U suffix might be mistaken for
indicating they are more general purpose, whereas they are really only
valid for the latin1 subset of Unicode (including the EBCDIC isomorphs).
When called with something outside the latin1 range, they will return
false.
This patch necessitated rearranging a few things in the file. I added
documentation for several more macros, and intend to document the rest.
(This commit was modified from its original form by Steffen.)
|
|
|
|
|
| |
This macro is clearer as to intent over isALNUM, and isn't confusable
with isALNUMC. So document it primarily.
|
| |
|
|
|
|
|
| |
There are a number of macros missing from the documentation. This helps
me figure out which ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch changes the macros whose names end in _A to use table lookup
except for the one (isASCII) which always has only one comparison.
The table is in l1_char_class_tab.h.
The advantage of this is speed. It replaces some fairly complicated
expressions with an O(1) look-up and a mask.
It uses the FITS_IN_8_BITS() macro to guarantee that the table bounds
are not exceeded. For legal inputs that are byte size, the optimizer
should get rid of this macro leaving only the lookup and mask.
(This commit was changed from its original form by Steffen.)
|
| |
|
|
|
|
| |
These macros return true only if the parameter is an ASCII character.
|
|
|
|
| |
as is better optimized and suitable for the purpose.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The name isALNUM() is problematic, as it is very close to isALNUMC(),
and doesn't mean exactly what most people might think. I presume the C
in isALNUMC stands for C language or libc, but am not sure. Others
don't know either. But in any event, isALNUM is different from the C
isalnum(), in that it matches the Perl concept of \w, which differs from
the C definition in exactly one place. Perl includes the underscore
character, '_'.
So, I'm adding a isWORDCHAR() macro for future code to use to be more
clear. I thought also about isWORD(), but I think confusion can arise
from thinking that means a whole word. isWORDCHAR_L1() matches in the
Latin1 range, to be equivalent to isALNUMU(). The motivation for using
L1 instead of U will be explained in a commit message for the other L1
macros that are to be added.
|
| |
|
| |
|
|
|
|
|
| |
The only change here is that I sorted these #defines within their
groups, to make it much easier to follow what's going on.
|
|
|
|
| |
It didn't include the Latin1 space components.
|
|
|
|
| |
It doesn't include NBSP
|
|
|
|
|
|
| |
The macro was using the ASCII definition, which doesn't include NEL nor
NBSP. But, libc contains the correct definition, which is usable on
EBCDIC since we don't worry about locales there.
|
|
|
|
|
|
| |
Commit 4125141464884619e852c7b0986a51eba8fe1636 improperly got rid of
EBCDIC handling, as it combined the ASCII and EBCDIC versions, but left
the result in the ASCII-only branch. Just move to the common code.
|
| |
|
|
|
|
| |
This is a synonym for isALNUMU
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this patch, if isASCII() is called with something like '256',
it would return true.
For some reason unknown to me, U64 is defined only inside the perl core.
However, the equivalent U64TYPE is known everywhere, so in the macro
that can be called outside of core, use that instead.
The commit log doesn't give a reason for not defining U64 outside of
core, and no tests in the suite fail when it is defined outside core.
But out of caution, I'm just doing this workaround instead of exposing
U64.
|
|
|
|
|
| |
EBCDIC platforms use isascii(), but is not in all libc's so better to
use our own.
|
|
|
|
|
| |
Previous documentation was wrong for EBCDIC platforms. This fixes that
and adds some more explanation.
|
|
|
|
|
|
| |
toUPPER() and toLOWER() were grouped with the character class functions
(in perlapi), to which they are related, but aren't the same. Create a
new heading for these.
|
|
|
|
|
|
|
|
|
|
| |
8 and 9 are not treated as alphas in parsing as opposed to illegal
octals.
This also adds tests to verify that 1-3 digits work in char classes.
I created an isOCTAL macro in case that lookup gets moved to a bit
field, as I plan to do later, for speed.
|
|
|
|
|
|
|
| |
This makes sure that the index into the arrays used to change between
lower and upper case will fit into their bounds; returning an error
character if not. The check is likely to be optimized out if the index
is stored in 8 bits.
|
|
|
|
|
|
|
| |
This macro is designed to be optimized out if the argument is
byte-length, but otherwise to be a bomb-proof way of making sure that
the argument occupies only 8 bits or fewer in whatever storage class it
is in.
|
|
|
|
| |
New macro lex_stuff_pvs(), wrapping lex_stuff_pvn() for literal strings.
|
|
|
|
|
|
| |
If a bug is found in the handy.h macros, it may be necessary to fix the
duplicates in the cpan module. This may require filing a bug report
there.
|
|
|
|
| |
Refactor the macro append_flags() in dump.c to use it.
|
|
|
|
|
|
|
|
|
|
| |
The function perl_ebcdic_control() is unnecessary, as the toCTRL macro
that calls it can be changed to just map EBCDIC to ASCII first, and then
doing the normal procedure.
This means that EBCDIC and ASCII will no longer diverge. Currently,
EBCIDIC gives a syntax error for inputs outside its domain, whereas the
ASCII version accepts some of them.
|
|
|
|
|
|
|
|
| |
Prior to this patch, there is a potential bug in these two macros, in
which, if they are called with a signed character outside the ASCII
range, it will be negative and they always returned true for negative.
Casting the parameter to an unsigned should fix that by having it be
interpreted as a number above the ASCII range.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ever since perl 4.000 we've only set the POSIX process name via
argv[0]. Unfortunately on Linux the POSIX name isn't used by utilities
like top(1), ps(1) and killall(1).
Now when we set C<$0 = "hello"> both C<qx[ps h $$]> (POSIX) and
C<qx[ps hc $$]> (legacy) will say "hello", instead of the latter being
"perl" as was previously the case.
See also the March 9 2010 thread "Why doesn't assignment to $0 on
Linux also call prctl()?" on perl5-porters.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bool b = (bool)some_int
doesn't necessarily do what you think. In some builds, bool is defined as
char, and that cast's behaviour is thus undefined. So this line in mg.c:
const bool was_temp = (bool)SvTEMP(sv);
was actually setting was_temp to false even when the SVs_TEMP flag was set.
Fix this by replacing all the (bool) casts with a new cBOOL() cast macro
that (hopefully) does the right thing.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to now just about anything has been legal for a character name in
\N{...}. This means that legal code was broken by having \N{3,4} for
example mean [^\n]{3,4}. Such code doesn't come from standard
charnames, but from legal custom translators.
This patch deprecates "unreasonable" names. handy.h is changed by the
addition of macros that taken together define the names we deem
reasonable, namely alpha beginning with alphanumerics and some
punctuations as continuations.
toke.c is changed to parse each name and to raise a warning if any
problematic characters are found.
Some tests and diagnostic documentation are also included.
|