isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCII

This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
author: Karl Williamson <khw@cpan.org> 2016-09-03 14:12:27 -0600
committer: Karl Williamson <khw@cpan.org> 2016-09-17 17:22:25 -0600
commit: 784d4f31222f1bf7421b1aab87276f4878d60363 (patch)
tree: a5e9004f9729bc99c19ccc866f834e83bcc8a3fc /regen
parent: 21cb232c014d719d883ed0f8d6185dc36037859e (diff)
download: perl-784d4f31222f1bf7421b1aab87276f4878d60363.tar.gz
1 files changed, 12 insertions, 18 deletions
diff --git a/regen/regcharclass.pl b/regen/regcharclass.pl
index de4d3a3047..bd677acd15 100755
--- a/regen/regcharclass.pl
+++ b/regen/regcharclass.pl
@@ -1637,35 +1637,29 @@ SURROGATE: Surrogate code points
 => UTF8 :safe
 \p{_Perl_Surrogate}
 
-# This program was run with this enabled, and the results copied to utf8.h;
-# then this was commented out because it takes so long to figure out these 2
-# million code points.  The results would not change unless utf8.h decides it
-# wants a maximum other than 4 bytes, or this program creates better
+# This program was run with this enabled, and the results copied to utf8.h and
+# utfebcdic.h; then this was commented out because it takes so long to figure
+# out these 2 million code points.  The results would not change unless utf8.h
+# decides it wants a different maximum, or this program creates better
 # optimizations.  Trying with 5 bytes used too much memory to calculate.
 #
 # We don't generate code for invariants here because the EBCDIC form is too
 # complicated and would slow things down; instead the user should test for
 # invariants first.
 #
-# NOTE: The number of bytes generated here must match the value in
-# IS_UTF8_CHAR_FAST in utf8.h
+# 0x1FFFFF was chosen because for both UTF-8 and UTF-EBCDIC, its start byte
+# is the same as 0x10FFFF, and it includes all the above-Unicode code points
+# that have that start byte.  In other words, it is the natural stopping place
+# that includes all Unicode code points.
 #
-#UTF8_CHAR: Matches legal UTF-8 encoded characters from 2 through 4 bytes
+#UTF8_CHAR: Matches legal UTF-8 variant code points up through the 0x1FFFFFF
 #=> UTF8 :no_length_checks only_ascii_platform
 #0x80 - 0x1FFFFF
 
-# This hasn't been commented out, but the number of bytes it works on has been
-# cut down to 3, so it doesn't cover the full legal Unicode range.  Making it
-# 5 bytes would cover beyond the full range, but takes quite a bit of time and
-# memory to calculate.  The generated table varies depending on the EBCDIC
-# code page.
+#UTF8_CHAR: Matches legal UTF-EBCDIC variant code points up through 0x1FFFFFF
+#=> UTF8 :no_length_checks only_ebcdic_platform
+#0xA0 - 0x1FFFFF
 
-# NOTE: The number of bytes generated here must match the value in
-# IS_UTF8_CHAR_FAST in utf8.h
-#
-UTF8_CHAR: Matches legal UTF-EBCDIC encoded characters from 2 through 3 bytes
-=> UTF8 :no_length_checks only_ebcdic_platform
-0xA0 - 0x3FFF
 
 QUOTEMETA: Meta-characters that \Q should quote
 => high :fast
author	Karl Williamson <khw@cpan.org>	2016-09-03 14:12:27 -0600
committer	Karl Williamson <khw@cpan.org>	2016-09-17 17:22:25 -0600
commit	784d4f31222f1bf7421b1aab87276f4878d60363 (patch)
tree	a5e9004f9729bc99c19ccc866f834e83bcc8a3fc /regen
parent	21cb232c014d719d883ed0f8d6185dc36037859e (diff)
download	perl-784d4f31222f1bf7421b1aab87276f4878d60363.tar.gz