summaryrefslogtreecommitdiff
path: root/regen/regcharclass.pl
Commit message (Collapse)AuthorAgeFilesLines
* Add is_XPERLSPACE_utf8_safe_backwards()Karl Williamson2022-03-191-0/+4
| | | | | This macro starts from the right side and matches UTF-8 white space characters.
* regen/regcharclass.pl: Add backwards UTF-8 triesKarl Williamson2022-03-191-21/+58
| | | | | | | | | | | | | | This adds the ability to generate a trie macro that starts at the right end of a string and backs up one matching byte at a time until a full character is matched; bailing immediately if a non-matching byte is found. Previously, the way to accomplish this was to call the function to hop back (which looked at the string byte by byte backwards until it found a non-continuation byte), and then look forwards for matching bytes. This new way is more efficient, as only the necessary bytes are examined.
* utf8.c: Rmv an EBCDIC dependencyKarl Williamson2021-08-141-0/+2
| | | | This is now generated by regcharclass.pl
* regcharclass.pl: Add fast surrogate UTF-8 trieKarl Williamson2021-08-071-1/+1
| | | | | This will be used in the next commit. It requires only the first two bytes to determine if a UTF-8 or UTF-EBCDIC sequence is for a surrogate
* regcharclass.pl: Further improve EBCDIC codeKarl Williamson2021-08-071-9/+25
| | | | | | | | | | | A couple of commits ago improved the generated output of this script. This builds on that. The improvements were to try a transform that could lead to fewer conditionals, as bytes were greouped in fewer ranges. But that introduced a useless transformation for the single element ranges that remain. This commit removes the transformation if not needed.
* regcharclass.pl: Make 2 locals into global hashesKarl Williamson2021-08-071-6/+8
| | | | This is in preparation for a future commit
* regcharclass.pl: Improve generated code for EBCDICKarl Williamson2021-08-071-32/+101
| | | | | | | | | | | | | | | | UTF-8 has some desirable characteristics not shared by UTF-EBCDIC. One example is all the continuation bytes are in a single range. By transforming a UTF-EBCDIC byte into I8 (similar to UTF-8), we gain those characteristics, and may be able to save a conditional or three. This commit creates a 2nd pass over the bytes that are to be matched, transforming them into I8. If that pass results in fewer conditionals than the traditional, native, generated code, use the fewer result. This saves quite a bit in some of the generated code, enabling the quotemeta macro to be represented in a single part; previously it had to be split to avoid compiler macro size limits.
* regcharclass.pl: White-space comment onlyKarl Williamson2021-08-071-15/+19
| | | | A future commit will put a block around this; indent now.
* regcharclass.pl: Get UTF EBCDIC translationsKarl Williamson2021-08-071-2/+12
| | | | These will be used in a future commit
* regcharclass.pl: Add ability to avoid wrong mnemonicKarl Williamson2021-08-071-1/+2
| | | | | | | A future commit will pass this function data that shouldn't be translated into a mnemonic, like 'f' for the letter f. The reason is that that code will potentially be executed on a machine with a different character set than what the mnemonic would be valid for.
* regcharclass.pl: Change variable nameKarl Williamson2021-08-071-6/+6
| | | | A future commit will use this differently than the current name implies
* regcharclass.pl: Reorder execution pathKarl Williamson2021-08-071-17/+16
| | | | | This moves a loop earlier in the execution path. This will be useful in a later commit
* regcharclass.pl: Rmv unused variableKarl Williamson2021-08-071-2/+1
|
* regcharclass.pl: Add an error checkKarl Williamson2021-08-071-1/+5
|
* regcharclass.pl: Move some code earlierKarl Williamson2021-08-071-39/+41
| | | | | We can short circuit some work by moving the test earlier. This does not change the generated file.
* regcharclass.pl: Rmv unused variableKarl Williamson2021-08-071-2/+0
|
* regen/regcharclass.pl: Use deref of an arrayKarl Williamson2021-08-071-5/+6
| | | | This will make future commits read better.
* regcharclass.h: Remove 2 EBCDIC dependenciesKarl Williamson2021-07-311-1/+6
| | | | | | | | | This commit makes is_HANGUL_ED_utf8_safe() return 0 unconditionally on EBCDIC platforms. This means its callers don't have to care what platform is running. Change the two callers to take advantage of this The commit also changes the description of the macro to be slightly more accurate
* regcharclass.h: #defines for non-chars by UTF8 lengthKarl Williamson2021-07-301-0/+48
| | | | | | | | | | This creates macros for the non-character code points so that, given the length of the UTF-8 sequence, only those ones that have that length match. This makes for more efficient processing, to be used in a future commit. The place where the length changes depends on the platform type, and these macros will keep the code from having to worry about that.
* style: Detabify regen files.Michael G. Schwern2021-01-171-4/+4
| | | | | | | | | | | They generate C files. Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl. I can't get it to easily ignore whitespace, `git diff --name-only` does not respect the -w flag. regen_perly.pl is left alone. That would require rebuilding perly.* which is beyond a simple indentation change.
* regen/regcharclass.pl: Mark intermediate macros as internalKarl Williamson2020-12-211-4/+4
| | | | | | | | The macros generated by this script may have to be split into sub-macros to make the overall macro fit the maximum number of characters allowed by the compiler for a macro definition. This commit adds a trailing underscore to the names of such intermediate macros so as to mark them as non-API for autodoc.
* regcharclass.pl: Get code point folding to a seqKarl Williamson2020-12-191-17/+30
| | | | | | | Previously regcharclass.pl could tell if an input string was a multi-character fold of some Unicode code point. This commit adds the ability to return what that code point is. This capability will be used in a later commit.
* regen/regcharclass.pl: Rmv special caseKarl Williamson2020-12-191-15/+1
| | | | | This avoided checking for optimizations. Whatever its original use, it doesn't do any good, and the optimizations are actually useful.
* regen/regcharclass.pl: Use smaller inRANGE versionKarl Williamson2020-12-061-2/+2
| | | | | | | | | | | | | The previous commit split inRANGE up so that code that was known to have valid inputs to it could use a component that didn't have all the compile-time checks (often duplicates) that otherwise are made. This commit changes to use that component. The reason the compile-time checks are unnecessary here, is this is machine-generated code known to meet the inRANGE input requirements. All those compile-time checks added up to being too large for some compilers to handle.
* regcharclass.h: Simplify some expressionsKarl Williamson2020-11-221-8/+9
| | | | | The regen script was improperyly collapsing two-element ranges into two separate elements, which caused extraneous code to be generated.
* regen/regcharclass.pl: White space onlyKarl Williamson2020-10-161-38/+72
| | | | This does some line wrapping, etc
* regen/regcharclass.pl: Rmv unused macroKarl Williamson2020-10-161-1/+1
|
* regen/regcharclass.pl: Use char instead of hexKarl Williamson2020-10-161-1/+31
| | | | | This changes the generated macros to use a printable character or mnemonic instead of a hex value. This makes the macros easier to read.
* regen/regcharclass.pl: Move parameter to callerKarl Williamson2020-10-161-3/+7
| | | | | | This commit changes a sub in this file to be passed a new parameter. This is in preparation for the value to be used in the caller. No need to derive it twice.
* regen/regcharclass.pl: Change member to methodKarl Williamson2020-10-161-26/+39
| | | | | | | | This will allow more flexibility in future commits to instead of using a static format, to use one based on the input value. The only non-white space change from this commit, is the reordering of a couple tests; I'm not sure why that happened.
* regcharclass.h: Add some macrosKarl Williamson2019-11-161-5/+25
| | | | | | These macros will be used in a future commit, and are for three-character folds. regen/regcharclass*.pl are changed for this purpose.
* regen charclass_invlists.hDavid Mitchell2019-10-031-1/+1
| | | | | | | this was missed from the previous commit Also, fix typo in regen/regcharclass.pl It was still referring to itself as Porting/regcharclass.pl
* regcharclass.h: Change to use new inRANGE macroKarl Williamson2019-03-301-93/+35
| | | | | This was done by changing regen/regcharclass.pl. This results in half the conditionals being needed, and in some cases better error checking.
* regen/regcharclass.pl: Remove obsolete macroKarl Williamson2019-02-051-4/+0
| | | | This has been replaced by regen/unicode_constants.pl some releases ago.
* fix typosAlexandr Savca2018-10-091-1/+1
| | | | | | | | Committer: For porting tests: Update $VERSION in 4 files. Run: ./perl -Ilib regen/mk_invlists.pl ./perl -Ilib regen/regcharclass.pl
* Make isC9_STRICT_UTF8_CHAR() an inline dfaKarl Williamson2018-07-051-47/+0
| | | | | This replaces a complicated trie with a dfa. This should cut down the number of conditionals encountered in parsing many code points.
* Make isSTRICT_UTF8_CHAR() an inline functionKarl Williamson2018-07-051-21/+0
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should have fewer conditionals for many code points.
* Use strict dfa to translate from UTF-8 to code pointKarl Williamson2018-07-051-0/+4
| | | | | | | With this commit, if a sequence passes the dfa, the result can be returned immediately. Previously some rare potentially problematic sequences could pass, which would then need further checking, which then have to be done always. So this speeds up the general case.
* Make isUTF8_char() an inline functionKarl Williamson2018-07-051-9/+0
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should require fewer conditionals to be executed for many code points.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-2/+2
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* fixup regen/regcharclass.pl for no '.'in @INCDavid Mitchell2017-04-071-1/+1
| | | | | Note that this isn't normally executed during build, so it wasn't spotted earlier.
* Switch most open() calls to three-argument form.John Lightsey2016-12-231-1/+1
| | | | | | | | | | Switch from two-argument form. Filehandle cloning is still done with the two argument form for backward compatibility. Committer: Get all porting tests to pass. Increment some $VERSIONs. Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl For: RT #130122
* regen/regcharclass.pl: Add const castKarl Williamson2016-12-111-1/+1
| | | | | | | | | | This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0, which manually added const qualifiers to some generated code in order to avoid some compiler warnings. This changes the code generator to use the same 'const' qualifier generally. The code changed by the other commit had been hand-edited after being generated to add branch prediction, which would be too hard to program in at this time, so the const additions also had to be hand-edited in.
* Patch unit tests to explicitly insert "." into @INC when needed.H.Merijn Brand2016-11-111-2/+2
| | | | | require calls now require ./ to be prepended to the file since . is no longer guaranteed to be in @INC.
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-0/+10
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-0/+43
|
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-18/+12
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* regen/regcharclass.pl: surrogates are code pointsKarl Williamson2016-09-171-1/+1
| | | | They are not "characters"
* Make 3 UTF-8 macros APIKarl Williamson2016-08-311-2/+2
| | | | | | | | | | | | | | | | | These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks.
* regen/regcharclass.pl: Work on early UnicodesKarl Williamson2015-07-281-3/+3
| | | | | This just changes, for properties that aren't defined in all Unicode versions, to use synonyms that are defined in all