summaryrefslogtreecommitdiff
path: root/utfebcdic.h
Commit message (Collapse)AuthorAgeFilesLines
* utfebcdic.h: Add commentsKarl Williamson2014-05-311-0/+2
|
* Fix definition of toCTRL() for EBCDICKarl Williamson2014-05-311-0/+4
| | | | | | The definition was incorrect. When going from control to printable name, we need to go from Latin1 -> Native, so that e.g., a 65 gets turned into the native 'A'
* Make many EBCDIC tables generated instead of hand-codedKarl Williamson2014-05-311-581/+6
| | | | | | | | | | | | | | This causes the generated file ebcdic_tables.h to be #included by utfebcdic.h instead of the hand-coded tables that were formerly there. This makes it much easier to add or remove support for EBCDIC code pages. The UTF-EBCDIC-related tables for 037 and POSIX-BC are somewhat modified from what they were before. They were changed by hand minimally a long time ago to prevent segfaults, but in so doing, they lost an important sorting characteristic of UTF-EBCDIC. The machine-generated versions retain the sorting, while also not doing the segfaults. utfebcdic.h has more detail about this, regarding tr16.
* utfebcdic.h: Comment changes onlyKarl Williamson2014-05-301-26/+45
| | | | Clarifications and typo fix.
* utf8.h, utfebcdic.h: Add #defineKarl Williamson2013-08-291-0/+2
|
* utfebcdic.h: Change 'unsigned char' to U8Karl Williamson2013-08-291-35/+35
| | | | This is for consistency with the rest of Perl
* utfebcdic.h: Add (UV) castKarl Williamson2013-08-291-1/+1
| | | | The operand of this macro is implicitly a UV. Make sure that it is.
* utfebcdic.h: Add commentKarl Williamson2013-08-291-0/+6
|
* utf8.h: Clean up and use START_MARK definitionKarl Williamson2013-08-291-1/+3
| | | | | | | | | The previous definition broke good encapsulation rules. UTF_START_MARK should return something that fits in a byte; it shouldn't be the caller that does this. So the mask is moved into the definition. This means it can apply only to the portion that creates something larger than a byte. Further, the EBCDIC version can be simplified, since 7 is the largest possible number of bytes in an EBCDIC UTF8 character.
* utfebcdic.h: Remove extra parameter expansionsJohn Goodyear2013-08-291-2/+2
| | | | | These two macros were improperly expanding the parameters as well as defining the operation, leading to compile errors.
* Add macro OFFUNISKIPKarl Williamson2013-08-291-1/+2
| | | | | | | | | This means use official Unicode code point numbering, not native. Doing this converts the existing UNISKIP calls in the code to refer to native code points, which is what they meant anyway. The terminology is somewhat ambiguous, but I don't think it will cause real confusion. NATIVE_SKIP is also introduced for situations where it is important to be precise.
* Make casing tables nativeKarl Williamson2013-08-291-10/+162
| | | | | These are final tables that haven't been converted to native character set casing.
* utfebcdic.h: Remove trailing spacesKarl Williamson2013-08-291-4/+4
|
* Deprecate NATIVE_TO_NEED and ASCII_TO_NEEDKarl Williamson2013-08-291-8/+0
| | | | | | | | | | | | | | | | | | These macros are no longer called in the Perl core. This commit turns them into functions so that they can use gcc's deprecation facility. I believe these were defective right from the beginning, and I have struggled to understand what's going on. From the name, it appears NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the appropriate parameter indicates that. But that is impossible to do correctly from that API, as for variant characters, it needs to return two bytes. It could only work correctly if ch is an I8 byte, which isn't native, and hence the name would be wrong. Similar arguments for ASCII_TO_NEED. The function S_append_utf8_from_native_byte(const U8 byte, U8** dest) does what I think NATIVE_TO_NEED intended.
* Use new clearer named #definesKarl Williamson2013-08-291-10/+17
| | | | | This converts several areas of code to use the more clearly named macros introduced in the previous commit
* utf8.h, utfebcdic.h: Create less confusing #definesKarl Williamson2013-08-291-8/+6
| | | | | | | | | | | This commit creates macros whose names mean something to me, and which I don't find confusing. The older names are retained for backwards compatibility. Future commits will fix bugs I introduced from misunderstanding the meaning of the older names. The older names are now #defined in terms of the newer ones, and moved so that they are only defined once, valid for both ASCII and EBCDIC platforms.
* Fix redeclaration compiler errors on EBCDICNicholas Clark2013-02-251-4/+5
| | | | | This patch was posted in http://markmail.org/message/pwjxbxnlazvxgsyw
* Add, fix commentsKarl Williamson2013-02-251-6/+6
|
* utf8.h, utfebcdic.h: Add, fix commentsKarl Williamson2013-02-151-2/+0
|
* utf8.h: Add commentsKarl Williamson2013-01-161-1/+3
| | | | This also reorders one #define to be closer to a related one.
* utfebcdic.h: white space onlyKarl Williamson2012-10-181-1/+1
|
* utf8.h: Correct some values for EBCDICKarl Williamson2012-10-141-0/+13
| | | | | | It occurred to me that EBCDIC has different maximums for the number of bytes a character can occupy. This moves the definition in utf8.h to within an #ifndef EBCDIC, and adds the correct values to utfebcdic.h
* utf8.h: Add macro to test if UTF8 code point isn't Latin1Karl Williamson2012-09-161-0/+1
|
* Remove some EBCDIC dependenciesKarl Williamson2012-09-131-19/+0
| | | | | | A new regen'd header file has been created that contains the native values for certain characters. By using those macros, we can eliminate EBCDIC dependencies.
* update the editor hints for spaces, not tabsRicardo Signes2012-05-291-2/+2
| | | | | This updates the editor hints in our files for Emacs and vim to request that tabs be inserted as spaces.
* utf8.h: Use correct definition of start byteKarl Williamson2012-04-261-1/+2
| | | | | | The previous definition allowed for (illegal) overlongs. The uses of this macro in the core assume that it is accurate. The inacurracy can cause such code to fail.
* utf8.h: Use correct UTF-8 downgradeable definitionChristian Hansen2012-04-261-1/+1
| | | | | | | | | Previously, the macro changed by this commit would accept overlong sequences. This patch was changed by the committer to to include EBCDIC changes; and in the non-EBCDIC case, to save a test, by using a mask instead, in keeping with the prior version of the code
* Bump several file copyright datesSteffen Schwigon2012-01-191-2/+2
| | | | | | | Sync copyright dates with actual changes according to git history. [Plus run regen_perly.h to update the SHA-256 checksums, and regen/regcharclass.pl to update regcharclass.h]
* utfebcdic.h: Add synonymous macrosKarl Williamson2011-11-211-0/+2
| | | | I8 is a synonym for 'UTF' in this context, and is more meaningful to me.
* Add #defines for 2 Latin1 charsKarl Williamson2011-02-271-0/+6
| | | | | These will be used in a future commit; the ordinals are different on EBCDIC vs. ASCII
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-3/+3
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* utfebcdic.h: comment additions, fix typoKarl Williamson2010-11-221-4/+4
|
* PL_fold wrong for EBCDIC platforms.Karl Williamson2010-11-221-0/+113
| | | | | | | | | | | | | | | | | | | | | The PL_fold table map on EBCDIC only works on the ASCII-subrange characters, not the full native Latin1. To fix this, I moved the table to utfebcdic.h for EBCDIC platforms, and actually changed it to three tables, one for each of the code pages known to Perl. There is no EBCDIC platform available to test on. What I did was hack together a program from existing code that does EBCDIC transforms. I ran it in ASCII mode, and verified that the generated table was identical to the Latin1 table I had previously constructed by hand and extensively tested. I then ran it on each of the three EBCDIC transforms, and verified that each matched the places in the original table that I knew were correct, all the ASCII alphabetics, the controls, and a few other code points. So these tables are at least as correct as the existing one, as they are identical to it for [A-Z], [a-z].
* More cleanup of utfebcdic.h and utf8.hkarl williamson2009-11-091-53/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Attached is a patch that removes from utfebcdic.h most definitions that are common to it and utf8.h, and moves them to the common area of utf8.h. The duplicate ones that are retained are each an integral part of a larger related set that do differ between the headers. Some of the definitions had started to drift, so this brings them back into line, with a lowered possibility of future drift. In particular the ones for the 'lazy' macros did not do quite as intended, especially in the EBCDIC case. The bugs were a small performance hit only, in that the macro was not quite as lazy as expected, and so loaded utf8_heavy.pl possibly unnecessarily. In examining these, I noted that the utf8.h definition of the start byte of a utf8 encoded string accepts invalid start bytes 0xC0 and 0xC1. These are invalid because they are for overlong encodings of ASCII code points. One is not supposed to allow these, and there have been security attacks, according to Wikipedia, against code that does. But I don't know all the ramifications for Perl of changing to exclude these, so I left it alone, but added a comment (and an item on my personal todo list to check into it). I made some comment clarifications, and removed some definitions marked as obsolete in utf8.h that are in fact no longer used. I added some synonyms for existing macros that more clearly reflect the use that I intend to put them to in future patches. From ba581aa4db767e5531ec0c0efdea5de4e9b09921 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Mon, 9 Nov 2009 08:38:24 -0700 Subject: [PATCH] Clean up utf headers Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
* Define specially handled chars; and clean-up ebcdic vs unicodeKarl Williamson2009-11-081-1/+14
|
* Put parentheses around macro argumentsRafael Garcia-Suarez2009-02-011-2/+2
|
* Cast result to character size before array indexedKarl2009-02-011-8/+11
|
* Bump coopyright year in embed.pl and various files that were just touchedRafael Garcia-Suarez2009-01-021-2/+2
| | | | (and run "make regen")
* Faster sv_utf8_upgrade()karl williamson2009-01-021-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider what currently happens when the tokenizer is scanning a string. It looks through it byte-by-byte until it finds a character that forces it to decide to go to utf8. It then calls sv_utf8_upgrade() with the portion of the string scanned so far. sv_utf8_upgrade() starts over from the beginning, and scans the string byte-by-byte until it finds a character that varies between non-utf8 and utf8. It then calls bytes_to_utf8(). bytes_to_utf8() allocates a new string that can handle the worst case expansion, 2n+1, of the entire string, and starts over from the beginning, and scans the input string byte-by-byte copying and converting each character to the output string as it goes. It doesn't return the size of the new string, so sv_utf8_upgrade() assumes it is only as big as what actually got converted, throwing away knowledge of any spare. It then returns to the tokenizer, which immediately does a grow to get space for the unparsed input. This is likely to cause a new string to be allocated and copied from the one we had just created, even if that string in actuality had enough space in it. Thus, the invariant head portion of the string is scanned 3 times, and probably 2 strings will be allocated and copied. My solution to cutting this down is to do several things. First, I added an extra flag for sv_utf8_upgrade that says don't bother to check if the string needs to be converted to utf8, just assume it does. This eliminates one of the passes. I also added a new parameter to sv_utf8_upgrade that says when you return, I want this much unused space in the string. That eliminates the extra grow. This was all done by renaming the current work-horse function from sv_utf8_upgrade_flags to be sv_utf8_upgrade_flags_grow() and making the current function name be a macro which calls the revised one with a 0 grow parameter. I also improved the internal efficiency of sv_utf8_upgrade so that when it does scan the string, it doesn't call bytes_to_utf8, but does the conversion itself, using a fast memory copy instead of the byte-oriented one for the invariant header, and it uses that header to get a better estimate of the needed size of the new string, and it doesn't throw away the knowledge of the allocated size. And, if it is clear without scanning the whole string that the conversion will fit in the already allocated string, it just uses that instead of allocating and copying a new one, using the algorithm I copied from the tokenizer. (In this case it does have to finish scanning the whole string to get the correct size.) The comments have details. It still is byte-oriented. Vectorization et. al. could yield performance improvements. One idea for that is in the comments. The patch also includes a new synonym I created which is a more accurate name than NATIVE_TO_ASCII.
* Update comments and documentation dealing with utfKarl2008-12-261-1/+63
|
* Add editor blocks to some header files.Marcus Holland-Moritz2008-01-011-1/+9
| | | p4raw-id: //depot/perl@32793
* Fix up copyright years for files modified in 2007.Nicholas Clark2007-11-071-1/+1
| | | p4raw-id: //depot/perl@32237
* Update copyright years in .h files. Also, in .plRafael Garcia-Suarez2007-01-051-1/+2
| | | | | | files that generate .h files, so they'll be ready next time. p4raw-id: //depot/perl@29695
* Some CPP macro sanitization by Sadahiro TomoyukiRafael Garcia-Suarez2006-06-291-1/+1
| | | p4raw-id: //depot/perl@28447
* Re: XS-assisted SWASHGET (esp. for t/uni/class.t speedup)SADAHIRO Tomoyuki2005-11-301-2/+2
| | | | | Message-Id: <20051127170016.A786.BQW10602@nifty.com> p4raw-id: //depot/perl@26229
* Third consting batchAndy Lester2005-03-241-3/+3
| | | | | Message-Id: <2f14220e7101a03f7659dbe79a03b115@petdance.com> p4raw-id: //depot/perl@24074
* It's UTF-8, not UTF8. (Note: not s/UTF-8/UTF8/,Jarkko Hietaniemi2003-09-121-1/+1
| | | | | | since that would break a lot of code.) Also few stray UTF16s, UTF32s, and "encoded in Unicode". p4raw-id: //depot/perl@21198
* Fix up Larry's copyright statements to my best knowledge.Jarkko Hietaniemi2003-04-161-1/+1
| | | | | | | (Lots of Perl 5 source code archaeology was involved.) Larry didn't make strangled noises when I showed him the patch, either :-) p4raw-id: //depot/perl@19242
* Reverse copyright update (#18801) for files not changed in 2003.Hugo van der Sanden2003-03-021-1/+1
| | | p4raw-id: //depot/perl@18807
* Update all copyrights to 2003, from JarkkoHugo van der Sanden2003-03-021-1/+1
| | | p4raw-id: //depot/perl@18801