| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
The code guarded by #ifndef U32_ALIGNMENT_REQUIRED attempts to optimize
byte-swapping by doing unaligned loads, but accessing data through
unaligned pointers is undefined behavior in C. Moreover, compilers are
more than capable of recognizing these open-coded byte-swap patterns and
emitting a bswap instruction, or an unaligned load instruction, or a
combined load, etc. There's no need for multiple paths to attain the
desired result.
See https://rt.perl.org/Ticket/Display.html?id=133495
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It turns out that only a single number is needed to distinguish between
basic UTF-8 and UTF-EBCDIC. And that is the number of bits of
real information are in each continuation byte. In UTF-8 it is 6 (2
bits reserved for syntax); In UTF-EBCDIC it is 5.
Everything else stems from reasonable decisions based on this
fundamental difference. So all the other constants can be common
between the two systems, using compile-time shifts and masks.
For Perl's extended UTF-8-like encoding, another constant is needed,
which is the number of continuation bytes appended when the start byte
is 8 bits. For both systems, that number is the minimum required to be
able to encode a 64-bit integer. (There are other ways to extend the
encoding, including some that are infinitely so. But Perl chose to just
append a fixed number of bytes, so it isn't extensible. But it has the
advantage of needing to rely only on the first byte to know how many
more are coming.)
This commit consolidates various constants that differed between the
two systems, but were unnecessarily so. There are other constants that
remain that differ between the two files; these are for convenience .
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
| |
| |
| |
| | |
This can be derived from other values, removing an EBCDIC dependency
|
|/
|
|
|
| |
This variable can be defined from the same base in both UTF-8 and
UTF-EBCDIC, and doing so eliminates an EBCDIC dependency.
|
| |
|
|
|
|
| |
The called macro does the cast already
|
|
|
|
|
| |
By doing an '| 0' with a parameter in a macro expansion, a C syntax
error will be generated. This is free protection.
|
|
|
|
|
|
| |
CPAN was mostly skipped before because so many distros raised errors,
but that is no longer true, so just skip about 10 that have big
problems, and test the rest
|
|
|
|
| |
0e06870bf080a38cda51c06c6612359afc2334e1
|
| |
|
|
|
|
|
| |
Note that utf8_distance returns IV, while STR_LEN is an unsigned value
of varying sizes.
|
|
|
|
|
|
|
| |
this was missed from the previous commit
Also, fix typo in regen/regcharclass.pl It was still referring to itself
as Porting/regcharclass.pl
|
| |
|
|
|
|
| |
The file from Unicode needs to be translated to native
|
|
|
|
| |
This table wasn't being translated into native code points
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Transform previously deprecated cases into exceptions.
Update diagnostic; change D to F
remove now irrelevant code (TonyC)
For: RT 134138
|
| |
|
|
|
|
|
|
|
|
|
| |
Commit x059703b088f44d5665f67fba0b9d80cad89085fd removed more code than
was intended. This commit restores the missing functions.
This showed up in MSWin32 builds, I presume VMS as well.
Spotted by Tony Cook
|
|
|
|
|
|
| |
On DEBUGGING builds, the asserts in the expansion of this macro build up
too large of literal strings for the Win32 compiler. Solve this by
storing to an intermediary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It has been deprecated since 5.26 to use various macros that deal with
UTF-8 inputs but don't have a parameter indicating the maximum length
beyond which we should not look. This commit changes all such macros,
as threatened in existing documentation and warning messages, to have an
extra parameter giving the length.
This was originally scheduled to happen in 5.30, but was delayed because
it broke some CPAN modules, and there wasn't really a good way around
it. But now that Devel::PPPort 3.54 is out, ppport.h has new facilities
for getting modules making these changes to work with older Perl
releases.
|
|
|
|
|
| |
The next commit removes some macros that this uses. They have been
deprecated, and the uses here were to test those deprecations.
|
|
|
|
|
|
| |
This silences a warning that the pragma it surrounds is not valid on
C++. We don't need to know that, and it clutters the compilation
output.
|
|
|
|
|
| |
This is like LEXACT, but it is known that only strings encoded in UTF-8
will match it, so don't even have to try if that condition isn't met.
|
|
|
|
|
|
| |
See the previous commit for info on these.
I am not changing trie code to recognize these at this time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode for strings that don't fit in a regular
one, and adds a structure for that regnode to use. Actually using them
is deferred to the next commit.
This new regnode structure is needed because the previous structure only
allows for an 8 bit length field, 255 max bytes. This commit puts the
length instead in a new field, the same place single-argument regnodes
put their argument. Hence this long string is an extra 32 bits of
overhead, but at no string length is this node ever bigger than the
combination of the smaller nodes it replaces.
I also considered simply combining the original 8 bit length field
(which is now unused) with the first byte of the string field to get a
16 bit length, and have the actual string be offset by 1. But I
rejected that because it would mean the string would usually not be
aligned, slowing down memory accesses.
This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold,
using 4K fewer overhead bytes to do so. That means it can handle
strings containing 262000 bytes. The comments give ideas for expanding
that should it become necessary or desirable.
Besides the space advantage, any hardware acceleration in memcmp
can be done in much bigger chunks, and otherwise the memcmp inner loop
(often written in assembly) will run many more times in a row, and our
outer loop that calls it, correspondingly fewer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the detection mechanism to check just before writing to see
if if would be out of bounds, and if so, instead break out of the loop,
and go close out the node. Prior to this commit space for a worst-case
scenario was reserved, and we didn't start a new character if we were in
that danger zone. This left nodes left fully packed than they could
have been.
Thus this improves the packing of nodes, especially under /i, from the
previous mechanism. But more importantly, it set things up so that we
can potentially increase the node size as we go along.
This also changes the handling of avoiding splitting a multi-character
fold across nodes under /i. For example, take the sequence 'ffi', We
wouldn't want to end a node with 'ff', when the first character in the
next node is an 'i', as U+FB03 folds to that sequence, and the code that
does pattern matching can't currently match across node boundaries.
Previously we backed off filling the node until the final character
wasn't one that could potentially cause such a break. That is we didn't
look at the next character and see if it was an 'i' (or some other
potential multi-char fold.) Now we do look at that next
character(s), and only back off if this actually would split a real
multi-char fold.
|
| |
|
|
|
|
| |
This is no longer used
|
|
|
|
|
|
|
|
|
| |
One of the variables is misnamed, the upper_fill indicates that the
node has to be left not completely filled. Comments will be added in a
later commit.
The other two are renamed in preparation for future changes to more
accurately describe their new purposes.
|
|
|
|
|
| |
Outdent a block that was doubly indented. Change some other white space
and fix grammar in a comment
|
|
|
|
|
| |
This is in preparation for the current mechanism in a later commit to
become a not legal lhs
|
| |
|
| |
|
|
|
|
| |
Fix issues noticed by porting/podcheck.t
|
|
|
|
|
|
| |
The last release version at this date is 3.52
turn the clock backward to 3.54 for now
so porting/cmp_version.t passes.
|
|
|
|
|
|
|
| |
Fixes GH #61 aka RT 134101
(cherry picked from commit 935b7556e54d4bd3c18fdfef2f072b674afb7051)
Signed-off-by: Nicolas R <atoomic@cpan.org>
|
| |
|
|
|
|
|
|
|
| |
This is updated to the latest blead.
(cherry picked from commit e7398cda98d95e464aefd3b7ab8a052bdf19c896)
Signed-off-by: Nicolas R <atoomic@cpan.org>
|
|
|
|
|
|
|
| |
So, don't use our macro that indicates we do provide it.
(cherry picked from commit 36672207f64165e8e58251a2a4cb4569984dadcd)
Signed-off-by: Nicolas R <atoomic@cpan.org>
|
|
|
|
|
| |
(cherry picked from commit 59c0a72a7f36c9f3e2c0779f5affc420499252b8)
Signed-off-by: Nicolas R <atoomic@cpan.org>
|
|
|
|
|
|
|
| |
Before these existed, they should be no-ops
(cherry picked from commit fbde8074e56bf0da478eb424c4bc9329ee48210b)
Signed-off-by: Nicolas R <atoomic@cpan.org>
|
|
|
|
|
| |
(cherry picked from commit bfe660f9f9775fc1cbbf1c5fd7ed809b3e4dd369)
Signed-off-by: Nicolas R <atoomic@cpan.org>
|