delta/perl.git - github.com: perl/perl5.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Add is_XPERLSPACE_utf8_safe_backwards()	Karl Williamson	2022-03-19	1	-0/+4
\| \| \| \| \|	This macro starts from the right side and matches UTF-8 white space characters.
*	regen/regcharclass.pl: Add backwards UTF-8 tries	Karl Williamson	2022-03-19	1	-21/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds the ability to generate a trie macro that starts at the right end of a string and backs up one matching byte at a time until a full character is matched; bailing immediately if a non-matching byte is found. Previously, the way to accomplish this was to call the function to hop back (which looked at the string byte by byte backwards until it found a non-continuation byte), and then look forwards for matching bytes. This new way is more efficient, as only the necessary bytes are examined.
*	utf8.c: Rmv an EBCDIC dependency	Karl Williamson	2021-08-14	1	-0/+2
\| \| \| \|	This is now generated by regcharclass.pl
*	regcharclass.pl: Add fast surrogate UTF-8 trie	Karl Williamson	2021-08-07	1	-1/+1
\| \| \| \| \|	This will be used in the next commit. It requires only the first two bytes to determine if a UTF-8 or UTF-EBCDIC sequence is for a surrogate
*	regcharclass.pl: Further improve EBCDIC code	Karl Williamson	2021-08-07	1	-9/+25
\| \| \| \| \| \| \| \| \| \| \|	A couple of commits ago improved the generated output of this script. This builds on that. The improvements were to try a transform that could lead to fewer conditionals, as bytes were greouped in fewer ranges. But that introduced a useless transformation for the single element ranges that remain. This commit removes the transformation if not needed.
*	regcharclass.pl: Make 2 locals into global hashes	Karl Williamson	2021-08-07	1	-6/+8
\| \| \| \|	This is in preparation for a future commit
*	regcharclass.pl: Improve generated code for EBCDIC	Karl Williamson	2021-08-07	1	-32/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UTF-8 has some desirable characteristics not shared by UTF-EBCDIC. One example is all the continuation bytes are in a single range. By transforming a UTF-EBCDIC byte into I8 (similar to UTF-8), we gain those characteristics, and may be able to save a conditional or three. This commit creates a 2nd pass over the bytes that are to be matched, transforming them into I8. If that pass results in fewer conditionals than the traditional, native, generated code, use the fewer result. This saves quite a bit in some of the generated code, enabling the quotemeta macro to be represented in a single part; previously it had to be split to avoid compiler macro size limits.
*	regcharclass.pl: White-space comment only	Karl Williamson	2021-08-07	1	-15/+19
\| \| \| \|	A future commit will put a block around this; indent now.
*	regcharclass.pl: Get UTF EBCDIC translations	Karl Williamson	2021-08-07	1	-2/+12
\| \| \| \|	These will be used in a future commit
*	regcharclass.pl: Add ability to avoid wrong mnemonic	Karl Williamson	2021-08-07	1	-1/+2
\| \| \| \| \| \| \|	A future commit will pass this function data that shouldn't be translated into a mnemonic, like 'f' for the letter f. The reason is that that code will potentially be executed on a machine with a different character set than what the mnemonic would be valid for.
*	regcharclass.pl: Change variable name	Karl Williamson	2021-08-07	1	-6/+6
\| \| \| \|	A future commit will use this differently than the current name implies
*	regcharclass.pl: Reorder execution path	Karl Williamson	2021-08-07	1	-17/+16
\| \| \| \| \|	This moves a loop earlier in the execution path. This will be useful in a later commit
*	regcharclass.pl: Rmv unused variable	Karl Williamson	2021-08-07	1	-2/+1
\|
*	regcharclass.pl: Add an error check	Karl Williamson	2021-08-07	1	-1/+5
\|
*	regcharclass.pl: Move some code earlier	Karl Williamson	2021-08-07	1	-39/+41
\| \| \| \| \|	We can short circuit some work by moving the test earlier. This does not change the generated file.
*	regcharclass.pl: Rmv unused variable	Karl Williamson	2021-08-07	1	-2/+0
\|
*	regen/regcharclass.pl: Use deref of an array	Karl Williamson	2021-08-07	1	-5/+6
\| \| \| \|	This will make future commits read better.
*	regcharclass.h: Remove 2 EBCDIC dependencies	Karl Williamson	2021-07-31	1	-1/+6
\| \| \| \| \| \| \| \| \|	This commit makes is_HANGUL_ED_utf8_safe() return 0 unconditionally on EBCDIC platforms. This means its callers don't have to care what platform is running. Change the two callers to take advantage of this The commit also changes the description of the macro to be slightly more accurate
*	regcharclass.h: #defines for non-chars by UTF8 length	Karl Williamson	2021-07-30	1	-0/+48
\| \| \| \| \| \| \| \| \| \|	This creates macros for the non-character code points so that, given the length of the UTF-8 sequence, only those ones that have that length match. This makes for more efficient processing, to be used in a future commit. The place where the length changes depends on the platform type, and these macros will keep the code from having to worry about that.
*	style: Detabify regen files.	Michael G. Schwern	2021-01-17	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	They generate C files. Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl. I can't get it to easily ignore whitespace, `git diff --name-only` does not respect the -w flag. regen_perly.pl is left alone. That would require rebuilding perly.* which is beyond a simple indentation change.
*	regen/regcharclass.pl: Mark intermediate macros as internal	Karl Williamson	2020-12-21	1	-4/+4
\| \| \| \| \| \| \| \|	The macros generated by this script may have to be split into sub-macros to make the overall macro fit the maximum number of characters allowed by the compiler for a macro definition. This commit adds a trailing underscore to the names of such intermediate macros so as to mark them as non-API for autodoc.
*	regcharclass.pl: Get code point folding to a seq	Karl Williamson	2020-12-19	1	-17/+30
\| \| \| \| \| \| \|	Previously regcharclass.pl could tell if an input string was a multi-character fold of some Unicode code point. This commit adds the ability to return what that code point is. This capability will be used in a later commit.
*	regen/regcharclass.pl: Rmv special case	Karl Williamson	2020-12-19	1	-15/+1
\| \| \| \| \|	This avoided checking for optimizations. Whatever its original use, it doesn't do any good, and the optimizations are actually useful.
*	regen/regcharclass.pl: Use smaller inRANGE version	Karl Williamson	2020-12-06	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	The previous commit split inRANGE up so that code that was known to have valid inputs to it could use a component that didn't have all the compile-time checks (often duplicates) that otherwise are made. This commit changes to use that component. The reason the compile-time checks are unnecessary here, is this is machine-generated code known to meet the inRANGE input requirements. All those compile-time checks added up to being too large for some compilers to handle.
*	regcharclass.h: Simplify some expressions	Karl Williamson	2020-11-22	1	-8/+9
\| \| \| \| \|	The regen script was improperyly collapsing two-element ranges into two separate elements, which caused extraneous code to be generated.
*	regen/regcharclass.pl: White space only	Karl Williamson	2020-10-16	1	-38/+72
\| \| \| \|	This does some line wrapping, etc
*	regen/regcharclass.pl: Rmv unused macro	Karl Williamson	2020-10-16	1	-1/+1
\|
*	regen/regcharclass.pl: Use char instead of hex	Karl Williamson	2020-10-16	1	-1/+31
\| \| \| \| \|	This changes the generated macros to use a printable character or mnemonic instead of a hex value. This makes the macros easier to read.
*	regen/regcharclass.pl: Move parameter to caller	Karl Williamson	2020-10-16	1	-3/+7
\| \| \| \| \| \|	This commit changes a sub in this file to be passed a new parameter. This is in preparation for the value to be used in the caller. No need to derive it twice.
*	regen/regcharclass.pl: Change member to method	Karl Williamson	2020-10-16	1	-26/+39
\| \| \| \| \| \| \| \|	This will allow more flexibility in future commits to instead of using a static format, to use one based on the input value. The only non-white space change from this commit, is the reordering of a couple tests; I'm not sure why that happened.
*	regcharclass.h: Add some macros	Karl Williamson	2019-11-16	1	-5/+25
\| \| \| \| \| \|	These macros will be used in a future commit, and are for three-character folds. regen/regcharclass*.pl are changed for this purpose.
*	regen charclass_invlists.h	David Mitchell	2019-10-03	1	-1/+1
\| \| \| \| \| \| \|	this was missed from the previous commit Also, fix typo in regen/regcharclass.pl It was still referring to itself as Porting/regcharclass.pl
*	regcharclass.h: Change to use new inRANGE macro	Karl Williamson	2019-03-30	1	-93/+35
\| \| \| \| \|	This was done by changing regen/regcharclass.pl. This results in half the conditionals being needed, and in some cases better error checking.
*	regen/regcharclass.pl: Remove obsolete macro	Karl Williamson	2019-02-05	1	-4/+0
\| \| \| \|	This has been replaced by regen/unicode_constants.pl some releases ago.
*	fix typos	Alexandr Savca	2018-10-09	1	-1/+1
\| \| \| \| \| \| \| \|	Committer: For porting tests: Update $VERSION in 4 files. Run: ./perl -Ilib regen/mk_invlists.pl ./perl -Ilib regen/regcharclass.pl
*	Make isC9_STRICT_UTF8_CHAR() an inline dfa	Karl Williamson	2018-07-05	1	-47/+0
\| \| \| \| \|	This replaces a complicated trie with a dfa. This should cut down the number of conditionals encountered in parsing many code points.
*	Make isSTRICT_UTF8_CHAR() an inline function	Karl Williamson	2018-07-05	1	-21/+0
\| \| \| \| \| \| \|	It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should have fewer conditionals for many code points.
*	Use strict dfa to translate from UTF-8 to code point	Karl Williamson	2018-07-05	1	-0/+4
\| \| \| \| \| \| \|	With this commit, if a sequence passes the dfa, the result can be returned immediately. Previously some rare potentially problematic sequences could pass, which would then need further checking, which then have to be done always. So this speeds up the general case.
*	Make isUTF8_char() an inline function	Karl Williamson	2018-07-05	1	-9/+0
\| \| \| \| \| \| \|	It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should require fewer conditionals to be executed for many code points.
*	Use new paradigm for hdr file double inclusion guard	Karl Williamson	2017-06-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
*	fixup regen/regcharclass.pl for no '.'in @INC	David Mitchell	2017-04-07	1	-1/+1
\| \| \| \| \|	Note that this isn't normally executed during build, so it wasn't spotted earlier.
*	Switch most open() calls to three-argument form.	John Lightsey	2016-12-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Switch from two-argument form. Filehandle cloning is still done with the two argument form for backward compatibility. Committer: Get all porting tests to pass. Increment some $VERSIONs. Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl For: RT #130122
*	regen/regcharclass.pl: Add const cast	Karl Williamson	2016-12-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0, which manually added const qualifiers to some generated code in order to avoid some compiler warnings. This changes the code generator to use the same 'const' qualifier generally. The code changed by the other commit had been hand-edited after being generated to add branch prediction, which would be too hard to program in at this time, so the const additions also had to be hand-edited in.
*	Patch unit tests to explicitly insert "." into @INC when needed.	H.Merijn Brand	2016-11-11	1	-2/+2
\| \| \| \| \|	require calls now require ./ to be prepended to the file since . is no longer guaranteed to be in @INC.
*	Add macro for Unicode Corregindum #9 strict	Karl Williamson	2016-09-17	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
*	Add macro for determining if UTF-8 is Unicode-strict	Karl Williamson	2016-09-17	1	-0/+43
\|
*	isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCII	Karl Williamson	2016-09-17	1	-18/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
*	regen/regcharclass.pl: surrogates are code points	Karl Williamson	2016-09-17	1	-1/+1
\| \| \| \|	They are not "characters"
*	Make 3 UTF-8 macros API	Karl Williamson	2016-08-31	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks.
*	regen/regcharclass.pl: Work on early Unicodes	Karl Williamson	2015-07-28	1	-3/+3
\| \| \| \| \|	This just changes, for properties that aren't defined in all Unicode versions, to use synonyms that are defined in all