Add check that "$!" is correctly interpreted as UTF-8

We sometimes need to know if an error message is UTF-8 or not. Previously we checked that it is syntactically valid UTF-8, and that the LC_MESSAGES locale is UTF-8. But some systems, notably Windows, do not have LC_MESSAGES. For those, this commit adds a different, semantic, check that the text of the message when interpreted as UTF-8 is all in the same Unicode script. This is not foolproof, unlike the LC_MESSAGES check, but it's better than what we have now for such systems. It likely is foolproof for non-Latin locales, as any message will have a bunch of characters in that locale, and no ASCII Latin ones. For a Latin locale, these ASCII letters could be intermixed with the UTF-8 ones, causing potential ambiguity.
author: Karl Williamson <khw@cpan.org> 2018-01-03 20:41:29 -0700
committer: Karl Williamson <khw@cpan.org> 2018-01-30 22:22:07 -0700
commit: fab19207eff5a75200691ff8ca26814b58e1a099 (patch)
tree: d323ee117c21585095c8ee0a77a7c1f49fa5063e /mg.c
parent: 61520eeebbc6694eb254adea13b7ace1d5519567 (diff)
download: perl-fab19207eff5a75200691ff8ca26814b58e1a099.tar.gz
1 files changed, 9 insertions, 3 deletions
diff --git a/mg.c b/mg.c
index c6e68d6f2d..68553275de 100644
--- a/mg.c
+++ b/mg.c
@@ -818,9 +818,9 @@ S_fixup_errno_string(pTHX_ SV* sv)
          * avoid as many possible backward compatibility issues as possible, we
          * don't turn on the flag unless we have to.  So the flag stays off for
          * an entirely invariant string.  We assume that if the string looks
-         * like UTF-8, it really is UTF-8:  "text in any other encoding that
-         * uses bytes with the high bit set is extremely unlikely to pass a
-         * UTF-8 validity test"
+         * like UTF-8 in a single script, it really is UTF-8:  "text in any
+         * other encoding that uses bytes with the high bit set is extremely
+         * unlikely to pass a UTF-8 validity test"
          * (http://en.wikipedia.org/wiki/Charset_detection).  There is a
          * potential that we will get it wrong however, especially on short
          * error message text, so do an additional check. */
@@ -831,6 +831,12 @@ S_fixup_errno_string(pTHX_ SV* sv)
 
             &&  _is_cur_LC_category_utf8(LC_MESSAGES)
 
+#else   /* If can't check directly, at least can see if script is consistent,
+           under UTF-8, which gives us an extra measure of confidence. */
+
+            && isSCRIPT_RUN((const U8 *) SvPVX_const(sv), (U8 *) SvEND(sv),
+                            TRUE, /* Means assume UTF-8 */
+                            NULL)
 #endif
 
         ) {
author	Karl Williamson <khw@cpan.org>	2018-01-03 20:41:29 -0700
committer	Karl Williamson <khw@cpan.org>	2018-01-30 22:22:07 -0700
commit	fab19207eff5a75200691ff8ca26814b58e1a099 (patch)
tree	d323ee117c21585095c8ee0a77a7c1f49fa5063e /mg.c
parent	61520eeebbc6694eb254adea13b7ace1d5519567 (diff)
download	perl-fab19207eff5a75200691ff8ca26814b58e1a099.tar.gz