| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
This macro starts from the right side and matches UTF-8 white space
characters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the ability to generate a trie macro that starts at the right
end of a string and backs up one matching byte at a time until a full
character is matched; bailing immediately if a non-matching byte is
found.
Previously, the way to accomplish this was to call the function to hop
back (which looked at the string byte by byte backwards until it found a
non-continuation byte), and then look forwards for matching bytes.
This new way is more efficient, as only the necessary bytes are
examined.
|
|
|
|
| |
This is now generated by regcharclass.pl
|
|
|
|
|
| |
This will be used in the next commit. It requires only the first two
bytes to determine if a UTF-8 or UTF-EBCDIC sequence is for a surrogate
|
|
|
|
|
|
|
|
|
|
|
| |
A couple of commits ago improved the generated output of this script.
This builds on that. The improvements were to try a transform that
could lead to fewer conditionals, as bytes were greouped in fewer
ranges.
But that introduced a useless transformation for the single element
ranges that remain. This commit removes the transformation if not
needed.
|
|
|
|
| |
This is in preparation for a future commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UTF-8 has some desirable characteristics not shared by UTF-EBCDIC. One
example is all the continuation bytes are in a single range.
By transforming a UTF-EBCDIC byte into I8 (similar to UTF-8), we gain
those characteristics, and may be able to save a conditional or three.
This commit creates a 2nd pass over the bytes that are to be matched,
transforming them into I8. If that pass results in fewer conditionals
than the traditional, native, generated code, use the fewer result.
This saves quite a bit in some of the generated code, enabling the
quotemeta macro to be represented in a single part; previously it had to
be split to avoid compiler macro size limits.
|
|
|
|
| |
A future commit will put a block around this; indent now.
|
|
|
|
| |
These will be used in a future commit
|
|
|
|
|
|
|
| |
A future commit will pass this function data that shouldn't be
translated into a mnemonic, like 'f' for the letter f. The reason is
that that code will potentially be executed on a machine with a
different character set than what the mnemonic would be valid for.
|
|
|
|
| |
A future commit will use this differently than the current name implies
|
|
|
|
|
| |
This moves a loop earlier in the execution path. This will be useful in
a later commit
|
| |
|
| |
|
|
|
|
|
| |
We can short circuit some work by moving the test earlier. This does
not change the generated file.
|
| |
|
|
|
|
| |
This will make future commits read better.
|
|
|
|
|
|
|
|
|
| |
This commit makes is_HANGUL_ED_utf8_safe() return 0 unconditionally on
EBCDIC platforms. This means its callers don't have to care what
platform is running. Change the two callers to take advantage of this
The commit also changes the description of the macro to be slightly more
accurate
|
|
|
|
|
|
|
|
|
|
| |
This creates macros for the non-character code points so that, given the
length of the UTF-8 sequence, only those ones that have that length
match. This makes for more efficient processing, to be used in a future
commit.
The place where the length changes depends on the platform type, and
these macros will keep the code from having to worry about that.
|
|
|
|
|
|
|
|
|
|
|
| |
They generate C files.
Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl.
I can't get it to easily ignore whitespace, `git diff --name-only`
does not respect the -w flag.
regen_perly.pl is left alone. That would require rebuilding
perly.* which is beyond a simple indentation change.
|
|
|
|
|
|
|
|
| |
The macros generated by this script may have to be split into sub-macros
to make the overall macro fit the maximum number of characters allowed
by the compiler for a macro definition. This commit adds a trailing
underscore to the names of such intermediate macros so as to mark them
as non-API for autodoc.
|
|
|
|
|
|
|
| |
Previously regcharclass.pl could tell if an input string was a
multi-character fold of some Unicode code point. This commit adds the
ability to return what that code point is. This capability will be used
in a later commit.
|
|
|
|
|
| |
This avoided checking for optimizations. Whatever its original use, it
doesn't do any good, and the optimizations are actually useful.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit split inRANGE up so that code that was known to have
valid inputs to it could use a component that didn't have all the
compile-time checks (often duplicates) that otherwise are made.
This commit changes to use that component. The reason the compile-time
checks are unnecessary here, is this is machine-generated code known to
meet the inRANGE input requirements.
All those compile-time checks added up to being too large for some
compilers to handle.
|
|
|
|
|
| |
The regen script was improperyly collapsing two-element ranges into two
separate elements, which caused extraneous code to be generated.
|
|
|
|
| |
This does some line wrapping, etc
|
| |
|
|
|
|
|
| |
This changes the generated macros to use a printable character or
mnemonic instead of a hex value. This makes the macros easier to read.
|
|
|
|
|
|
| |
This commit changes a sub in this file to be passed a new parameter.
This is in preparation for the value to be used in the caller. No need
to derive it twice.
|
|
|
|
|
|
|
|
| |
This will allow more flexibility in future commits to instead of using a
static format, to use one based on the input value.
The only non-white space change from this commit, is the reordering of a
couple tests; I'm not sure why that happened.
|
|
|
|
|
|
| |
These macros will be used in a future commit, and are for
three-character folds. regen/regcharclass*.pl are changed for this
purpose.
|
|
|
|
|
|
|
| |
this was missed from the previous commit
Also, fix typo in regen/regcharclass.pl It was still referring to itself
as Porting/regcharclass.pl
|
|
|
|
|
| |
This was done by changing regen/regcharclass.pl. This results in half
the conditionals being needed, and in some cases better error checking.
|
|
|
|
| |
This has been replaced by regen/unicode_constants.pl some releases ago.
|
|
|
|
|
|
|
|
| |
Committer: For porting tests: Update $VERSION in 4 files.
Run:
./perl -Ilib regen/mk_invlists.pl
./perl -Ilib regen/regcharclass.pl
|
|
|
|
|
| |
This replaces a complicated trie with a dfa. This should cut down the
number of conditionals encountered in parsing many code points.
|
|
|
|
|
|
|
| |
It was a macro that used a trie. This changes to use the dfa
constructed in previous commits. I didn't bother with taking
measurements. A dfa should have fewer conditionals for many code
points.
|
|
|
|
|
|
|
| |
With this commit, if a sequence passes the dfa, the result can be
returned immediately. Previously some rare potentially problematic
sequences could pass, which would then need further checking, which then
have to be done always. So this speeds up the general case.
|
|
|
|
|
|
|
| |
It was a macro that used a trie. This changes to use the dfa
constructed in previous commits. I didn't bother with taking
measurements. A dfa should require fewer conditionals to be executed
for many code points.
|
|
|
|
|
|
|
|
|
|
| |
We changed to use symbols not likely to be used by non-Perl code that
could conflict, and which have trailing underbars, so they don't look
like a regular Perl #define.
See https://rt.perl.org/Ticket/Display.html?id=131110
There are many more header files which are not guarded.
|
|
|
|
|
| |
Note that this isn't normally executed during build, so it wasn't spotted
earlier.
|
|
|
|
|
|
|
|
|
|
| |
Switch from two-argument form. Filehandle cloning is still done with the two
argument form for backward compatibility.
Committer: Get all porting tests to pass. Increment some $VERSIONs.
Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
For: RT #130122
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0,
which manually added const qualifiers to some generated code in order to
avoid some compiler warnings. This changes the code generator to use
the same 'const' qualifier generally. The code changed by the other
commit had been hand-edited after being generated to add branch
prediction, which would be too hard to program in at this time, so the
const additions also had to be hand-edited in.
|
|
|
|
|
| |
require calls now require ./ to be prepended to the file since . is no
longer guaranteed to be in @INC.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
|
|
|
| |
They are not "characters"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These may be useful to various module writers. They certainly are
useful for Encode. This makes public API macros to determine if the
input UTF-8 represents (one macro for each category)
a) a surrogate code point
b) a non-character code point
c) a code point that is above Unicode's legal maximum.
The macros are machine generated. In making them public, I am now using
the string end location parameter to guard against running off the end
of the input. Previously this parameter was ignored, as their use in
the core could be tightly controlled so that we already knew that the
string was long enough when calling these macros. But this can't be
guaranteed in the public API. An optimizing compiler should be able to
remove redundant length checks.
|
|
|
|
|
| |
This just changes, for properties that aren't defined in all Unicode
versions, to use synonyms that are defined in all
|