diff options
author | Karl Williamson <khw@cpan.org> | 2018-01-03 20:41:29 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-01-30 22:22:07 -0700 |
commit | fab19207eff5a75200691ff8ca26814b58e1a099 (patch) | |
tree | d323ee117c21585095c8ee0a77a7c1f49fa5063e /mg.c | |
parent | 61520eeebbc6694eb254adea13b7ace1d5519567 (diff) | |
download | perl-fab19207eff5a75200691ff8ca26814b58e1a099.tar.gz |
Add check that "$!" is correctly interpreted as UTF-8
We sometimes need to know if an error message is UTF-8 or not.
Previously we checked that it is syntactically valid UTF-8, and that the
LC_MESSAGES locale is UTF-8. But some systems, notably Windows, do not
have LC_MESSAGES. For those, this commit adds a different, semantic,
check that the text of the message when interpreted as UTF-8 is all in
the same Unicode script. This is not foolproof, unlike the LC_MESSAGES
check, but it's better than what we have now for such systems. It
likely is foolproof for non-Latin locales, as any message will have a
bunch of characters in that locale, and no ASCII Latin ones. For a
Latin locale, these ASCII letters could be intermixed with the UTF-8
ones, causing potential ambiguity.
Diffstat (limited to 'mg.c')
-rw-r--r-- | mg.c | 12 |
1 files changed, 9 insertions, 3 deletions
@@ -818,9 +818,9 @@ S_fixup_errno_string(pTHX_ SV* sv) * avoid as many possible backward compatibility issues as possible, we * don't turn on the flag unless we have to. So the flag stays off for * an entirely invariant string. We assume that if the string looks - * like UTF-8, it really is UTF-8: "text in any other encoding that - * uses bytes with the high bit set is extremely unlikely to pass a - * UTF-8 validity test" + * like UTF-8 in a single script, it really is UTF-8: "text in any + * other encoding that uses bytes with the high bit set is extremely + * unlikely to pass a UTF-8 validity test" * (http://en.wikipedia.org/wiki/Charset_detection). There is a * potential that we will get it wrong however, especially on short * error message text, so do an additional check. */ @@ -831,6 +831,12 @@ S_fixup_errno_string(pTHX_ SV* sv) && _is_cur_LC_category_utf8(LC_MESSAGES) +#else /* If can't check directly, at least can see if script is consistent, + under UTF-8, which gives us an extra measure of confidence. */ + + && isSCRIPT_RUN((const U8 *) SvPVX_const(sv), (U8 *) SvEND(sv), + TRUE, /* Means assume UTF-8 */ + NULL) #endif ) { |