| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
| |
This commit creates macros whose names mean something to me, and which I
don't find confusing. The older names are retained for backwards
compatibility. Future commits will fix bugs I introduced from
misunderstanding the meaning of the older names.
The older names are now #defined in terms of the newer ones, and moved
so that they are only defined once, valid for both ASCII and EBCDIC
platforms.
|
|
|
|
|
| |
This patch was posted in
http://markmail.org/message/pwjxbxnlazvxgsyw
|
| |
|
| |
|
|
|
|
| |
This also reorders one #define to be closer to a related one.
|
| |
|
|
|
|
|
|
| |
It occurred to me that EBCDIC has different maximums for the number of
bytes a character can occupy. This moves the definition in utf8.h to
within an #ifndef EBCDIC, and adds the correct values to utfebcdic.h
|
| |
|
|
|
|
|
|
| |
A new regen'd header file has been created that contains the native
values for certain characters. By using those macros, we can eliminate
EBCDIC dependencies.
|
|
|
|
|
| |
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
|
|
|
|
|
|
| |
The previous definition allowed for (illegal) overlongs. The uses of
this macro in the core assume that it is accurate. The inacurracy can
cause such code to fail.
|
|
|
|
|
|
|
|
|
| |
Previously, the macro changed by this commit would accept overlong
sequences.
This patch was changed by the committer to to include EBCDIC changes;
and in the non-EBCDIC case, to save a test, by using a mask instead, in
keeping with the prior version of the code
|
|
|
|
|
|
|
| |
Sync copyright dates with actual changes according to git history.
[Plus run regen_perly.h to update the SHA-256 checksums, and
regen/regcharclass.pl to update regcharclass.h]
|
|
|
|
| |
I8 is a synonym for 'UTF' in this context, and is more meaningful to me.
|
|
|
|
|
| |
These will be used in a future commit; the ordinals are different on
EBCDIC vs. ASCII
|
|
|
|
|
|
|
|
|
| |
# New Ticket Created by (Peter J. Acklam)
# Please include the string: [perl #81904]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 >
Signed-off-by: Abigail <abigail@abigail.be>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The PL_fold table map on EBCDIC only works on the ASCII-subrange
characters, not the full native Latin1.
To fix this, I moved the table to utfebcdic.h for EBCDIC platforms, and
actually changed it to three tables, one for each of the code pages
known to Perl.
There is no EBCDIC platform available to test on. What I did was hack
together a program from existing code that does EBCDIC transforms. I
ran it in ASCII mode, and verified that the generated table was
identical to the Latin1 table I had previously constructed by hand and
extensively tested. I then ran it on each of the three EBCDIC
transforms, and verified that each matched the places in the original
table that I knew were correct, all the ASCII alphabetics, the controls,
and a few other code points.
So these tables are at least as correct as the existing one, as they are
identical to it for [A-Z], [a-z].
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Attached is a patch that removes from utfebcdic.h most definitions that
are common to it and utf8.h, and moves them to the common area of
utf8.h. The duplicate ones that are retained are each an integral part
of a larger related set that do differ between the headers.
Some of the definitions had started to drift, so this brings them back
into line, with a lowered possibility of future drift. In particular
the ones for the 'lazy' macros did not do quite as intended, especially
in the EBCDIC case. The bugs were a small performance hit only, in that
the macro was not quite as lazy as expected, and so loaded utf8_heavy.pl
possibly unnecessarily. In examining these, I noted that the utf8.h
definition of the start byte of a utf8 encoded string accepts invalid
start bytes 0xC0 and 0xC1. These are invalid because they are for
overlong encodings of ASCII code points. One is not supposed to allow
these, and there have been security attacks, according to Wikipedia,
against code that does. But I don't know all the ramifications for Perl
of changing to exclude these, so I left it alone, but added a comment
(and an item on my personal todo list to check into it).
I made some comment clarifications, and removed some definitions marked
as obsolete in utf8.h that are in fact no longer used.
I added some synonyms for existing macros that more clearly reflect the
use that I intend to put them to in future patches.
From ba581aa4db767e5531ec0c0efdea5de4e9b09921 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Mon, 9 Nov 2009 08:38:24 -0700
Subject: [PATCH] Clean up utf headers
Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
|
| |
|
| |
|
| |
|
|
|
|
| |
(and run "make regen")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider what currently happens when the tokenizer is scanning a string.
It looks through it byte-by-byte until it finds a character that forces
it to decide to go to utf8. It then calls sv_utf8_upgrade() with the
portion of the string scanned so far.
sv_utf8_upgrade() starts over from the beginning, and scans the string
byte-by-byte until it finds a character that varies between non-utf8 and
utf8. It then calls bytes_to_utf8().
bytes_to_utf8() allocates a new string that can handle the worst case
expansion, 2n+1, of the entire string, and starts over from the
beginning, and scans the input string byte-by-byte copying and
converting each character to the output string as it goes.
It doesn't return the size of the new string, so sv_utf8_upgrade()
assumes it is only as big as what actually got converted, throwing away
knowledge of any spare.
It then returns to the tokenizer, which immediately does a grow to get
space for the unparsed input. This is likely to cause a new string to
be allocated and copied from the one we had just created, even if that
string in actuality had enough space in it.
Thus, the invariant head portion of the string is scanned 3 times, and
probably 2 strings will be allocated and copied.
My solution to cutting this down is to do several things.
First, I added an extra flag for sv_utf8_upgrade that says don't bother
to check if the string needs to be converted to utf8, just assume it
does. This eliminates one of the passes.
I also added a new parameter to sv_utf8_upgrade that says when you
return, I want this much unused space in the string. That eliminates
the extra grow.
This was all done by renaming the current work-horse function from
sv_utf8_upgrade_flags to be sv_utf8_upgrade_flags_grow() and making the
current function name be a macro which calls the revised one with a 0
grow parameter.
I also improved the internal efficiency of sv_utf8_upgrade so that when
it does scan the string, it doesn't call bytes_to_utf8, but does the
conversion itself, using a fast memory copy instead of the byte-oriented
one for the invariant header, and it uses that header to get a better
estimate of the needed size of the new string, and it doesn't throw away
the knowledge of the allocated size.
And, if it is clear without scanning the whole string that the
conversion will fit in the already allocated string, it just uses that
instead of allocating and copying a new one, using the algorithm I
copied from the tokenizer. (In this case it does have to finish
scanning the whole string to get the correct size.) The comments have
details.
It still is byte-oriented. Vectorization et. al. could yield
performance improvements. One idea for that is in the comments.
The patch also includes a new synonym I created which is a more accurate
name than NATIVE_TO_ASCII.
|
| |
|
|
|
| |
p4raw-id: //depot/perl@32793
|
|
|
| |
p4raw-id: //depot/perl@32237
|
|
|
|
|
|
| |
files that generate .h files, so they'll be ready
next time.
p4raw-id: //depot/perl@29695
|
|
|
| |
p4raw-id: //depot/perl@28447
|
|
|
|
|
| |
Message-Id: <20051127170016.A786.BQW10602@nifty.com>
p4raw-id: //depot/perl@26229
|
|
|
|
|
| |
Message-Id: <2f14220e7101a03f7659dbe79a03b115@petdance.com>
p4raw-id: //depot/perl@24074
|
|
|
|
|
|
| |
since that would break a lot of code.) Also few
stray UTF16s, UTF32s, and "encoded in Unicode".
p4raw-id: //depot/perl@21198
|
|
|
|
|
|
|
| |
(Lots of Perl 5 source code archaeology was involved.)
Larry didn't make strangled noises when I showed him
the patch, either :-)
p4raw-id: //depot/perl@19242
|
|
|
| |
p4raw-id: //depot/perl@18807
|
|
|
| |
p4raw-id: //depot/perl@18801
|
|
|
| |
p4raw-id: //depot/perl@16888
|
|
|
| |
p4raw-id: //depot/perl@16857
|
|
|
|
|
|
| |
From: "Roca Carrio, Ignasi (PO EP)" <Ignasi.Roca@fujitsu-siemens.com>
Message-ID: <318B95F90D8BD41194A5009027FD5FFBCE6CED@madrid14.mad.fsc.net>
p4raw-id: //depot/perl@16855
|
|
|
|
|
| |
Message-Id: <13817376786.20020312002021@motor.ru>
p4raw-id: //depot/perl@15189
|
|
|
| |
p4raw-id: //depot/perl@14391
|
|
|
|
|
|
| |
patch: rename HINT_BYTE and IN_BYTE to HINT_BYTES and IN_BYTES
to match the pragma name; various robustness cleanups.
p4raw-id: //depot/perl@10339
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Loose the extra level of function on ASCII.
- spotted a chr(0) issue in sv.c
- re-work of UTF-X tr/// ranges to work in Unicode
space. Still issues with the "0xff is illegal UTF-8" hack.
- Yet another ad. hoc. utf8 'upgrade' in op.c recoded
(why do it once when you can do it all over the place :-(
- Enable HINTS_UTF8 on EBCDIC - then ignore it in toke.c,
need utf8.pm for swashes.
- Simplified and commented scan_const() in toke.c
Still something wrong regexp and tr (swashes?).
p4raw-id: //depot/perlio@9267
|
|
|
| |
p4raw-id: //depot/perlio@9246
|
|
|
|
|
| |
Builds and passes many tests on OS390.
p4raw-id: //depot/perlio@9190
|
|
encoding on EBCDIC platforms. This has property that U+0000..U+009F i.e.
a superset of ASCII are invariant under the encoding. This is EBCDIC
friendly as an encoded string can be looked at as being EBCDIC by lexer
sprintf("%d",...) etc. in same manner that a UTF-8 string be considered
ASCII on ASCII machines.
- re-arrange utf8.h to get ASCII specific vs Unicode generic bits
seperate.
- Add some more macros to comprehend different shift amounts and
possible swizzle in UTF-EBCDIC vs UTF-8. Change utf8.c to use them.
- add utfebcdic.h which provides UTF-EBCDIC versions of the macros,
and conditionally #include it.
EBCDIC build as yet untested. ASCII still fails the one test.
p4raw-id: //depot/perlio@9185
|