summaryrefslogtreecommitdiff
path: root/mg.c
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2018-01-03 20:41:29 -0700
committerKarl Williamson <khw@cpan.org>2018-01-30 22:22:07 -0700
commitfab19207eff5a75200691ff8ca26814b58e1a099 (patch)
treed323ee117c21585095c8ee0a77a7c1f49fa5063e /mg.c
parent61520eeebbc6694eb254adea13b7ace1d5519567 (diff)
downloadperl-fab19207eff5a75200691ff8ca26814b58e1a099.tar.gz
Add check that "$!" is correctly interpreted as UTF-8
We sometimes need to know if an error message is UTF-8 or not. Previously we checked that it is syntactically valid UTF-8, and that the LC_MESSAGES locale is UTF-8. But some systems, notably Windows, do not have LC_MESSAGES. For those, this commit adds a different, semantic, check that the text of the message when interpreted as UTF-8 is all in the same Unicode script. This is not foolproof, unlike the LC_MESSAGES check, but it's better than what we have now for such systems. It likely is foolproof for non-Latin locales, as any message will have a bunch of characters in that locale, and no ASCII Latin ones. For a Latin locale, these ASCII letters could be intermixed with the UTF-8 ones, causing potential ambiguity.
Diffstat (limited to 'mg.c')
-rw-r--r--mg.c12
1 files changed, 9 insertions, 3 deletions
diff --git a/mg.c b/mg.c
index c6e68d6f2d..68553275de 100644
--- a/mg.c
+++ b/mg.c
@@ -818,9 +818,9 @@ S_fixup_errno_string(pTHX_ SV* sv)
* avoid as many possible backward compatibility issues as possible, we
* don't turn on the flag unless we have to. So the flag stays off for
* an entirely invariant string. We assume that if the string looks
- * like UTF-8, it really is UTF-8: "text in any other encoding that
- * uses bytes with the high bit set is extremely unlikely to pass a
- * UTF-8 validity test"
+ * like UTF-8 in a single script, it really is UTF-8: "text in any
+ * other encoding that uses bytes with the high bit set is extremely
+ * unlikely to pass a UTF-8 validity test"
* (http://en.wikipedia.org/wiki/Charset_detection). There is a
* potential that we will get it wrong however, especially on short
* error message text, so do an additional check. */
@@ -831,6 +831,12 @@ S_fixup_errno_string(pTHX_ SV* sv)
&& _is_cur_LC_category_utf8(LC_MESSAGES)
+#else /* If can't check directly, at least can see if script is consistent,
+ under UTF-8, which gives us an extra measure of confidence. */
+
+ && isSCRIPT_RUN((const U8 *) SvPVX_const(sv), (U8 *) SvEND(sv),
+ TRUE, /* Means assume UTF-8 */
+ NULL)
#endif
) {