| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
| |
UTS was a mainframe version of System V created by Amdahl, subsequently sold
to UTS Global. The port has not been touched since before 5.8.0, and UTS
Global is now defunct.
|
|
|
|
|
|
| |
- Use I64/UI64 suffixes rather than I64TYPE/U64TYPE casts for
INT64_C/UINT64_C, not just when _WIN64 is defined
- Use UI64 suffix rather than UL for U64_CONST
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Input text to be matched under /i is placed in EXACTFish nodes. The
current limit on such text is 255 bytes per node. Even if we raised
that limit, it will always be finite. If the input text is longer than
this, it is split across 2 or more nodes. A problem occurs when that
split occurs within a potential multi-character fold. For example, if
the final character that fits in a node is 'f', and the next character
is 'i', it should be matchable by LATIN SMALL LIGATURE FI, but because
Perl isn't structured to find multi-char folds that cross node
boundaries, we will miss this it.
The solution presented here isn't optimum. What we do is try to prevent
all EXACTFish nodes from ending in a character that could be at the
beginning or middle of a multi-char fold. That prevents the problem.
But in actuality, the problem only occurs if the input text is actually
a multi-char fold, which happens much less frequently. For example,
we try to not end a full node with an 'f', but the problem doesn't
actually occur unless the adjacent following node begins with an 'i' (or
one of the other characters that 'f' participates in). That is, this
patch splits when it doesn't need to.
At the point of execution for this patch, we only know that the final
character that fits in the node is that 'f'. The next character remains
unparsed, and could be in any number of forms, a literal 'i', or a hex,
octal, or named character constant, or it may need to be decoded (from
'use encoding'). So look-ahead is not really viable.
So finding if a real multi-character fold is involved would have to be
done later in the process, when we have full knowledge of the nodes, at
the places where join_exact() is now called, and would require inserting
a new node(s) in the middle of existing ones.
This solution seems reasonable instead.
It does not yet address named character constants (\N{}) which currently
bypass the code added here.
|
|
|
|
|
|
|
|
|
|
| |
This starts with the existing table that mktables generates that lists
all the characters in Unicode that occur in multi-character folds, and
aren't in the final positions of any such fold.
It generates data structures with this information to make it quickly
available to code that wants to use it. Future commits will use these
tables.
|
|
|
|
|
|
|
|
| |
The 80286 was released two years before Perl 1, but the support code was
added with Perl 3. The chip hasn't been produced for more than 15 years -
even the 80386 hasn't been manufactured since 2007. Most of the other
memory model code was removed by commit 5869b1f143426909 in Sep 2000, so
support for 16 bit systems is long dead.
|
|
|
|
|
| |
This synchronizes the ANYOF_FOO usages to the isFOO() usages. Future
commits will take advantage of this relationship.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This array is a bit map containing the Posix and similar character
classes for the first 256 code points. Prior to this commit many
character classes were represented by two bits, one for characters that
are in it over the full Latin-1 range, and one for just the ASCII
characters that are in it. The number of bits in use was approaching
the 32-bit limit available without playing games.
This commit takes advantage of a recent commit that adds a bit to the
table for all the ASCII characters, and the fact that the ASCII
characters in a character class are a subset of the full Latin1
range. So, iff both the full-range character class bit and the ASCII
bit is set is that character an ASCII-range character with the given
character class.
A new internal macro is created to generate code to determine if a
character is an ASCII range character with the given class. It's not
clear if the generated code is faster or slower than the full range
version.
The result is that nearly half the bits are freed up, as the ones for
the ASCII-range are now redundant.
|
|
|
|
| |
This macro abstracts an operation, and will make future commits cleaner.
|
|
|
|
| |
This test is duplicated in the called macro
|
|
|
|
| |
This moves a #define next to similar ones, and removes some white space
|
|
|
|
|
|
| |
This changes the #defines to be just the shift number, while doing
the shifting in the macro that the number is passed to. This will prove
useful in future commits
|
|
|
|
|
|
| |
These are renumbered so that the ones that correspond to character
classes in regcomp.h are related numerically as well. This will prove
useful in future commits.
|
|
|
|
|
| |
They are now ordered in the same order as the similar #defines in
regcomp.h. This will be useful in later commits
|
|
|
|
|
|
|
| |
This does not replace the isASCII macro definition, as I think the
current one is more efficient than this one provides. But future
commits will rely on all the named character classes (e.g.,
/[[:ascii:]]/) having a bit, and this is the only one missing.
|
|
|
|
|
| |
This creates a new, unpublished, macro to implement most of the other
macros. This macro will be useful in future commits.
|
|
|
|
| |
Tests to follow in a future commit.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
These macros have never worked outside the Latin1 range, so this extends
them to work.
There are no tests I could find for things in handy.h, except that many
of them are called all over the place during the normal course of
events. This commit adds a new file for such testing, containing for
now only with a few tests for the isBLANK's
|
|
|
|
|
| |
perlapi currently claims StructCopy takes two structs when it really
takes two pointers.
|
| |
|
|
|
|
|
| |
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
|
|
|
|
|
|
|
| |
Commit c2da0b36ccf7393a329af732fac4153ddf6ab42e changed this macro, and
created a syntax error. But it turns out that there were no current
calls to it in the Perl core. When I tried adding one, it showed the
failure.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new definition is likely slightly faster, as it replaces an array
lookup with a mask.
Comments are also added, listing the other possible candidates for this
treatment, though the speed differential is unclear as they would also
add an extra test.
A U32 is used to store the information about the various properties for
a character. This frees up one bit of that for future other use.
|
|
|
|
|
| |
These functions should be used in preference to the old ones which can
read beyond the end of the input string.
|
|
|
|
|
| |
Making this an unsigned constant silences the scary and wrong Solaris
warnings about integer overflow
|
|
|
|
| |
This tests if a Latin1 character should be quoted.
|
|
|
|
|
|
| |
This takes advantage of the recently added Configure probe, and if the
platform has an isblank library function, calls that under locale. This
now matches the documentation
|
|
|
|
|
|
|
| |
We have code that assumes that ASCII should be locale dependent, but it
was missing its final link. This supplies that, and makes the code work
as documented. I thought it better to do that then to document yet
another exception.
|
| |
|
|
|
|
| |
(cherry picked from commit 0cebf65582f924952bfee1472749d442d51e43e6)
|
|
|
|
|
|
| |
_is_utf8__perl_idstart is not an API function, so the short
_is_utf8__perl_idstart form cannot be used in public macros.
The long form (Perl__is_utf8__perl_idstart) must be used.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
On HP-UX 10.20 in the HP C-ANSI-C environment
CAT2(macro, _A)
expands to
macro01
as _A obviously expands to 01. This fix "breaks" the token
|
|
|
|
|
|
|
|
|
|
| |
It's much more likely that a random character will have its ordinal be
above the ordinal for '7' than below. In the test for if a character is
octal then, testing first if it is <= '7' will exclude many more
possibilities than if the first test is if it is >= '0'.
I left the ones for lowercase letters in the same order, because, in
ASCII, anyway, there are more characters below 'a' than above it.
|
| |
|
|
|
|
|
|
| |
This has the incorrect definition, allowing 8 and 9, for programs that
don't include perl.h. Likely no one actually uses this recently added
macro who doesn't also include perl.h.
|
| |
|
|
|
|
|
| |
Unoptimized, the new definition takes signficantly fewer machine
instructions than the old one
|
|
|
|
|
| |
This is clearer, and leads to better unoptimized code at least.
'bar' is a boolean
|
|
|
|
|
|
| |
This now takes advantage of the new table that mktables generates
to find out if a character is a legal start character in Perl's
definition. Previously, it had to be looked up in two tables.
|
| |
|
|
|
|
| |
This macro is in the pod, but never got defined.
|
|
|
|
|
|
| |
This patch avoids the overhead of calling eg. is_utf8_alpha() on Latin1
inputs. The result is known to Perl's core, and this can avoid a swash
load.
|
|
|
|
|
|
| |
This patch avoids the overhead of calling eg. is_utf8_alpha() on ASCII
inputs. The result is known to Perl's core, and this can avoid a swash
load.
|
|
|
|
|
| |
This patch avoids the overhead of calling eg. is_uni_alpha() if the
result is known to Perl's core. This can avoid a swash load.
|
|
|
|
|
|
|
|
|
| |
Unicode stability policy guarantees that no code points will ever be
added to the control characters beyond those already in it.
All such characters are in the Latin1 range, and so the Perl core
already knows which ones those are, and so there is no need to go out to
disk and create a swash for these.
|
|
|
|
|
|
|
| |
Only the characters whose ordinals are 0-127 are ASCII. This is
trivially computed by the macro, so no need to call is_uni_ascii() to do
this. Also, since ASCII characters are the same when represented in
utf8 or not, the utf8 function call is also superfluous.
|
|
|
|
|
|
|
|
|
|
|
| |
Thus retains essentially the same definition for EBCDIC platforms, but
substitutes a simpler one for ASCII platforms. On my system, the new
definition compiles to about half the assembly instructions that the old
one did (non-optimized)
A bomb-proof definition of ASCII is to make sure that the value is
unsigned in the largest possible unsigned for the platform so there is
no possible loss of information, and then the ord must be < 128.
|
|
|
|
|
|
| |
This creates a #define for the platforms widest UV, and then uses this
in the FITS_IN_8ITS definition, instead of #ifdef'ing that. This will
be useful in future commits.
|