summaryrefslogtreecommitdiff
path: root/regen
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2012-09-05 20:56:09 -0600
committerKarl Williamson <public@khwilliamson.com>2012-09-13 21:14:04 -0600
commit4d6461409e812aecb1fa745debb6132ce8e5612d (patch)
tree233a2c093d46c73bc151240415219e0e7ed41b11 /regen
parentae1d4929d23a3d6949518058aa41cd90a700a4af (diff)
downloadperl-4d6461409e812aecb1fa745debb6132ce8e5612d.tar.gz
utf8.h: Use machine generated IS_UTF8_CHAR()
This takes the output of regen/regcharclass.pl for all the 1-4 byte UTF8-representations of Unicode code points, and replaces the current hand-rolled definition there. It does this only for ASCII platforms, leaving EBCDIC to be machine generated when run on such a platform. I would rather have both versions to be regenerated each time it is needed to save an EBCDIC dependency, but it takes more than 10 minutes on my computer to process the 2 billion code points that have to be checked for on ASCII platforms, and currently t/porting/regen.t runs this program every times; and that slow down would be unacceptable. If this is ever run under EBCDIC, the macro should be machine computed (very slowly). So, even though there is an EBCDIC dependency, it has essentially been solved.
Diffstat (limited to 'regen')
-rwxr-xr-xregen/regcharclass.pl16
1 files changed, 16 insertions, 0 deletions
diff --git a/regen/regcharclass.pl b/regen/regcharclass.pl
index c4f5951a3c..7d126428ef 100755
--- a/regen/regcharclass.pl
+++ b/regen/regcharclass.pl
@@ -1161,6 +1161,22 @@ GCB_V: Grapheme_Cluster_Break=V
=> UTF8 :fast
\p{_X_GCB_V}
+# This program was run with this enabled, and the results copied to utf8.h;
+# then this was commented out because it takes so long to figure out these 2
+# million code points. The results would not change unless utf8.h decides it
+# wants a maximum other than 4 bytes, or this program creates better
+# optimizations
+#UTF8_CHAR: Matches utf8 from 1 to 4 bytes
+#=> UTF8 :safe only_ascii_platform
+#0x0 - 0x1FFFFF
+
+# This hasn't been commented out, because we haven't an EBCDIC platform to run
+# it on, and the 3 types of EBCDIC allegedly supported by Perl would have
+# different results
+UTF8_CHAR: Matches utf8 from 1 to 5 bytes
+=> UTF8 :safe only_ebcdic_platform
+0x0 - 0x3FFFFF:
+
QUOTEMETA: Meta-characters that \Q should quote
=> high :fast
\p{_Perl_Quotemeta}