diff options
author | Karl Williamson <khw@cpan.org> | 2021-05-25 12:08:06 -0600 |
---|---|---|
committer | Nicholas Clark <nick@ccl4.org> | 2021-06-30 08:32:27 +0000 |
commit | 1615185ee59e90a57a3b865c537ba4895a8bd2fd (patch) | |
tree | e24328e64d07147be0084b33cd1e19f80f4c24a2 /dist | |
parent | fcbd4a2938c13706c5e26cd84141255c0ac9e202 (diff) | |
download | perl-1615185ee59e90a57a3b865c537ba4895a8bd2fd.tar.gz |
Dumper.xs: Output orphaned EBCDIC control as octal
This makes the code simpler, and removes the need to worry about and
comment on EBCDIC.
On ASCII machines there are the C0 controls, the C1 controls, and DEL,
which isn't technically in either set. The C0 and DEL controls are
treated as low ordinal, and output using octal notation. This commit
has no behavior changes on ASCII platforms.
On EBCDIC machines, there are 1-1 mappings to the entire set of 65 ASCII
controls. All but one are in a single block and have been output using
octal. This commit doesn't change the behavior of the 64 single-block
controls.
There is a lone control that isn't adjacent to the others, orphaned.
This commit's only effect is to cause it to be displayed using octal
instead of hex. I believe the simplification of the code warrants this
change.
On extant EBCDIC platforms that Perl supports, this control is 0xFF,
named EO or EIGHT ONES, and is somewhat like DEL on ASCII platforms,
which we already display as octal, even though it is much higher ordinal
than any other control displayed as octal.
Diffstat (limited to 'dist')
-rw-r--r-- | dist/Data-Dumper/Dumper.xs | 31 |
1 files changed, 11 insertions, 20 deletions
diff --git a/dist/Data-Dumper/Dumper.xs b/dist/Data-Dumper/Dumper.xs index b5d7e39fc6..0c1b98773e 100644 --- a/dist/Data-Dumper/Dumper.xs +++ b/dist/Data-Dumper/Dumper.xs @@ -254,13 +254,10 @@ esc_q_utf8(pTHX_ SV* sv, const char *src, STRLEN slen, I32 do_utf8, I32 useqq) normal++; } } - else if (! isASCII(k) && k > ' ') { - /* High ordinal non-printable code point. (The test that k is - * above SPACE should be optimized out by the compiler on - * non-EBCDIC platforms; otherwise we could put an #ifdef around - * it, but it's better to have just a single code path when - * possible. All but one of the non-ASCII EBCDIC controls are low - * ordinal; that one is the only one above SPACE.) + else if (! UTF8_IS_INVARIANT(k)) { + /* We treat as low ordinal any code point whose representation is + * the same under UTF-8 as not. Thus, this is a high ordinal code + * point. * * If UTF-8, output as hex, regardless of useqq. This means there * is an overhead of 4 chars '\x{}'. Then count the number of hex @@ -329,18 +326,10 @@ esc_q_utf8(pTHX_ SV* sv, const char *src, STRLEN slen, I32 do_utf8, I32 useqq) U8 c0 = *(U8 *)s; UV k; - if (do_utf8 - && ! isASCII(c0) - /* Exclude non-ASCII low ordinal controls. This should be - * optimized out by the compiler on ASCII platforms; if not - * could wrap it in a #ifdef EBCDIC, but better to avoid - * #if's if possible */ - && c0 > ' ' - ) { - - /* When in UTF-8, we output all non-ascii chars as \x{} - * reqardless of useqq, except for the low ordinal controls on - * EBCDIC platforms */ + if (do_utf8 && ! UTF8_IS_INVARIANT(c0)) { + + /* In UTF-8, we output as \x{} all chars that require more than + * a single byte in UTF-8 to represent. */ k = utf8_to_uvchr_buf((U8*)s, (U8*) send, NULL); /* treat invalid utf8 byte by byte. This loop iteration gets the @@ -602,7 +591,9 @@ dump_regexp(pTHX_ SV *retval, SV *val) k = *p; } - if ((k == '/' && !saw_backslash) || (do_utf8 && ! isASCII(k) && k > ' ')) { + if ((k == '/' && !saw_backslash) || ( do_utf8 + && ! UTF8_IS_INVARIANT(k))) + { STRLEN to_copy = p - (U8 *) rval; if (to_copy) { /* If saw_backslash is true, this will copy the \ for us too. */ |