summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Add macro OFFUNISKIPKarl Williamson2013-08-294-7/+18
| | | | | | | | | This means use official Unicode code point numbering, not native. Doing this converts the existing UNISKIP calls in the code to refer to native code points, which is what they meant anyway. The terminology is somewhat ambiguous, but I don't think it will cause real confusion. NATIVE_SKIP is also introduced for situations where it is important to be precise.
* toke.c: white space onlyKarl Williamson2013-08-291-1/+3
|
* toke.c: Don't remap \N{} for EBCDICKarl Williamson2013-08-291-14/+13
| | | | Everything but \N{U+XXXX} is now in native,
* utf8.c: Stop using two functionsKarl Williamson2013-08-296-23/+85
| | | | | | | | | | | | | | | | | This is in preparation for deprecating these functions, to force any code that has been using these functions to change. Since the Unicode tables are now stored in native order, these functions should only rarely be needed. However, the functionality of these is needed, and in actuality, on ASCII platforms, the native functions are #defined to these. So what this commit does is rename the functions to something else, and create wrappers with the old names, so that anyone using them will get the deprecation when it actually goes into effect: we are waiting for CPAN files distributed with the core to change before doing the deprecation. According to cpan.grep.me, this should affect fewer than 10 additional CPAN distributions.
* Convert uvuni_to_utf8() to functionKarl Williamson2013-08-295-11/+9
| | | | | | | Code should almost never be dealing with non-native code points This is in preparation for later deprecation when our CPAN modules have been converted away from using it.
* Deprecate utf8_to_uni_buf()Karl Williamson2013-08-293-9/+10
| | | | | | | Now that the tables are stored in native order, there is almost no need for code to be dealing in Unicode order. According to grep.cpan.me, there are no uses of this function in CPAN.
* Deprecate valid_utf8_to_uvuni()Karl Williamson2013-08-293-3/+5
| | | | | | | | | Now that all the tables are stored in native format, there is very little reason to use this function; and those who do need this kind of functionality should be using the bottom level routine, so as to make it clear they are doing nonstandard stuff. According to grep.cpan.me, there are no uses of this function in CPAN.
* utf8.c: Swap which fcn wraps the otherKarl Williamson2013-08-295-41/+32
| | | | This is in preparation for the current wrapee becoming deprecated
* utf8.c: Skip a no-opKarl Williamson2013-08-291-1/+1
| | | | | Since the value is invariant under both UTF-8 and not, we already have it in 'uv'; no need to do anything else to get it
* utf8.c: Move comment to where makes more senseKarl Williamson2013-08-291-3/+3
|
* APItest: Test native code points, instead of UnicodeKarl Williamson2013-08-292-13/+13
|
* Don't refer to U+XXXX when mean nativeKarl Williamson2013-08-292-3/+3
| | | | | These messages say the output number is Unicode, but it is really native, so change to saying is 0xXXXX.
* Convert some uvuni() to uvchr()Karl Williamson2013-08-298-49/+46
| | | | | All the tables are now based on the native character set, so using uvuni() in almost all cases is wrong.
* handy.h: White space onlyKarl Williamson2013-08-291-23/+23
|
* t/test.pl: Allow native/latin1 string conversions to work on utf8.Karl Williamson2013-08-291-102/+14
| | | | | | | | | | | | | These functions no longer have the hard-coded definitions in them, but now end up resolving to internal functions, so that new encodings could be added and these would automatically understand them. Instead of using tr///, these now go character by character and converting to/from ord, which is slower, but allows them to operate on utf8 strings. Peephole optimization should make these essentially no-ops on ascii platforms.
* t/test.pl: Simplify ord to/from native fcnsKarl Williamson2013-08-291-4/+4
| | | | | | This commit changes these functions from converting to/from a string to calling functions which operate on ordinals instead that are in the utf8:: namespace.
* Make casing tables nativeKarl Williamson2013-08-292-13/+165
| | | | | These are final tables that haven't been converted to native character set casing.
* utfebcdic.h: Remove trailing spacesKarl Williamson2013-08-291-4/+4
|
* EBCDIC has the unicode bug tooKarl Williamson2013-08-291-28/+2
| | | | | | | | | | | | | | We have not had a working modern Perl on EBCDIC for some years. When I started out, comments and code led me to conclude erroneously that natively it supported semantics for all 256 characters 0-255. It turns out that I was wrong; it natively (at least on some platforms) has the same rules (essentially none) for the characters which don't correspond to ASCII ones, as the rules for these on ASCII platforms. A previous commit for 5.18 changed the docs about this issue. This current commit forces ASCII rules on EBCDIC platforms (even should there be one that natively uses all 256). To get all 256, the same things like 'use feature "unicode_strings"' must now be done.
* handy.h: Solve a failure to compile problem under EBCDICKarl Williamson2013-08-291-13/+22
| | | | | | | | | handy.h is included in files that don't include perl.h, and hence not utf8.h. We can't rely therefore on the ASCII/EBCDIC conversion macros being available to us. The best way to cope is to use the native ctype functions. Most, but not all, of the macros in this commit currently resolve to use those native ones, but a future commit will change that.
* handy.h: Simplify some macro definitionsKarl Williamson2013-08-291-6/+3
| | | | | Now, only one of the macros relies on magic numbers (isPRINT), leading to clearer definitions.
* handy.h: Combine macros that are same in ASCII, EBCDICKarl Williamson2013-08-291-8/+4
| | | | | | | | These 4 macros can have the same RHS for their ASCII and EBCDIC versions, so no need to duplicate their definitions This also enables the EBCDIC versions to not have undefined expansions when compiling without perl.h
* Deprecate NATIVE_TO_NEED and ASCII_TO_NEEDKarl Williamson2013-08-296-21/+27
| | | | | | | | | | | | | | | | | | These macros are no longer called in the Perl core. This commit turns them into functions so that they can use gcc's deprecation facility. I believe these were defective right from the beginning, and I have struggled to understand what's going on. From the name, it appears NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the appropriate parameter indicates that. But that is impossible to do correctly from that API, as for variant characters, it needs to return two bytes. It could only work correctly if ch is an I8 byte, which isn't native, and hence the name would be wrong. Similar arguments for ASCII_TO_NEED. The function S_append_utf8_from_native_byte(const U8 byte, U8** dest) does what I think NATIVE_TO_NEED intended.
* Remove remaining calls of NATIVE_TO_NEEDKarl Williamson2013-08-291-3/+3
| | | | | | These calls are just copying the input to the output byte by byte. There is no need to worry about UTF-8 or not, as the output is just an exact copy of the input
* toke.c: Remove some NATIVE_TO_NEED callsKarl Williamson2013-08-291-10/+10
| | | | | | | | I believe NATIVE_TO_NEED is defective, and will remove it in a future commit. But, just in case I'm wrong, I'm doing it in small steps so bisects will show the culprit. This removes the calls to it where the parameter is clearly invariant under UTF-8 and UTF-EBCDIC, and so the result can't be other than just the parameter.
* toke.c: Emphasize that only [A-Za-z] is used hereKarl Williamson2013-08-291-5/+5
| | | | | | This code is attempting to deal with the problem of holes in the ranges a-z and A-Z in EBCDIC. By using macros with the suffix "_A", we emphasize that.
* Use real illegal UTF-8 byteKarl Williamson2013-08-293-13/+15
| | | | | | | | | | | | | | | | The code here was wrong in assuming that \xFF is not legal in UTF-8 encoded strings. It currently doesn't work due to a bug, but that may eventually be fixed: [perl #116867]. The comments are also wrong that all bytes are legal in UTF-EBCDIC. It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear (C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte of an illegal overlong sequence. This creates a #define for an illegal byte using one of the real illegal ones, and changes the code to use that. No test is included due to #116867.
* toke.c: Remove remapping for EBCDIC for octalKarl Williamson2013-08-291-4/+3
| | | | | | | | | | | | The code prior to this commit converted something like \04 into its EBCDIC equivalent only in double-quoted strings. This was not done in patterns, and so gave inconsistent results. The correct thing to do should be to do the native thing, what someone who works on a platform would think \04 do. Platform independent characters are available through \N{}, either by name or by U+XXXX. The comment changed by this was wrong, as in some cases it was native, and in some cases Unicode.
* Remove EBCDIC remappingsKarl Williamson2013-08-296-92/+52
| | | | | | | | Now that the Unicode tables are stored in native format, we shouldn't be doing remapping. Note that this assumes that the Latin1 casing tables are stored in native order; not all of this has been done yet.
* Add and use macro to return EBCDICKarl Williamson2013-08-297-40/+45
| | | | | | | | The conversion from UTF-8 to code point should generally be to the native code point. This adds a macro to do that, and converts the core calls to the existing macro to use the new one instead. The old macro is retained for possible backwards compatibility, though it probably should be deprecated.
* charnames: Make work in EBCDICKarl Williamson2013-08-292-12/+30
| | | | | Now that mktables generates native tables, we need to make U+XXXX mean Unicode instead of native.
* Unicode::UCD: Work on non-ASCII platformsKarl Williamson2013-08-291-33/+83
| | | | | Now that mktables generates native tables, it is a fairly simple matter to get Unicode::UCD to work on those platforms.
* mktables: Generate native code-point tablesKarl Williamson2013-08-291-35/+161
| | | | | | | | | | | The output tables for mktables are now in the platform's native character set. This means there is no change for ASCII platforms, but is a change for EBCDIC ones. Code that didn't realize there was a potential difference between EBCDIC and non-EBCDIC platforms will now start to work; code that tried to do the right thing under these circumstances will no longer work. Fixing that comes in later commits.
* mktables: Move table creation codeKarl Williamson2013-08-291-20/+12
| | | | | | | | | | This code is moved later in the process. This is in preparation for mktables generating tables in the native character set. By moving it to later, the translation to native has already been done, and special coding need not be done. This also caught 7 code points that were omitted somehow in the previous logic
* Fix some EBCDIC problemsKarl Williamson2013-08-293-6/+5
| | | | | | | These spots have native code points, so should be using the macros for native code points, instead of Unicode ones; this also changes to use the portable symbolic name of a code point instead of the non-portable hex.
* Remove unnecessary temp variable in converting to UTF-8Karl Williamson2013-08-293-16/+13
| | | | These areas of code included a temporary that is unnecessary.
* utf8.h: Correct macros for EBCDICKarl Williamson2013-08-291-5/+10
| | | | | These macros were incorrect for EBCDIC. The 3 step process given in utfebcdic.h wasn't being followed.
* Extract common code to an inline functionKarl Williamson2013-08-298-44/+37
| | | | | This fairly short paradigm is repeated in several places; a later commit will improve it.
* Don't use EBCDIC macro for a C language escapeKarl Williamson2013-08-292-3/+3
| | | | | | | | | | C recognizes '\a' (for BEL); just use that instead of a look-up. regen/unicode_constants.pl could be used to generate the character for the ESC (set in nearby code), but I didn't do that because of potential bootstrapping problems when porting to an EBCDIC platform without a working perl. (The other characters generated in that .pl are less likely to cause problems when compiling perl.)
* Use byte domain EBCDIC/LATIN1 macro where appropriateKarl Williamson2013-08-293-31/+31
| | | | | | | | | | The macros like NATIVE_TO_UNI will work on EBCDIC, but operate on the whole Unicode range. In the locations affected by this commit, it is known that the domain is limited to a single byte, so the simpler ones whose names contain LATIN1 may be used. On ASCII platforms, all the macros are null, so there is no effective change.
* Use new clearer named #definesKarl Williamson2013-08-295-30/+37
| | | | | This converts several areas of code to use the more clearly named macros introduced in the previous commit
* utf8.h, utfebcdic.h: Create less confusing #definesKarl Williamson2013-08-292-17/+31
| | | | | | | | | | | This commit creates macros whose names mean something to me, and which I don't find confusing. The older names are retained for backwards compatibility. Future commits will fix bugs I introduced from misunderstanding the meaning of the older names. The older names are now #defined in terms of the newer ones, and moved so that they are only defined once, valid for both ASCII and EBCDIC platforms.
* pp_ctl.c: Use isCNTRL instead of hard-coded maskKarl Williamson2013-08-291-13/+9
| | | | | | | | This is clearer and portable to EBCDIC. There is a subtle difference in the behavior here, which I believe is a bug fix. Before, the code didn't treat DEL as a control, and now it does.
* utf8.c: is_utf8_char_slow() should use native lengthKarl Williamson2013-08-291-1/+1
| | | | | | What is passed is the actual length of the native utf8 character. What this was calculating was the length it would be if it were a Unicode character, and then compared, apples to oranges.
* Stop t/op/coreamp.t leaving temporary files behindSteve Hay2013-08-291-0/+9
| | | | | | test.pl will delete any file made by tempfile(), but it won't delete the *.dir and *.pag files made by dbmopen. Hopefully this is the right way to delete them, cribbed from lib/AnyDBM_File.t.
* Stop t/io/utf8.t leaving a temporary file behind on WindowsSteve Hay2013-08-291-0/+1
| | | | Filehandles need to be closed for unlink to work on Windows.
* Prevent ExtUtils-CBuilder leaving test output on WindowsSteve Hay2013-08-291-16/+39
| | | | | | | | | | The link function call in the have_compiler and have_cplusplus tests create a compilet.def file on Windows which is correctly recorded for cleaning up when the EU::CB object is destroyed, but if another one gets made in the meantime then ExtUtils::Mksymlists::Mksymlists moves the first one to compilet_def.old, which isn't recorded for cleaning up and gets left behind when the test script has finished. Using a new object each time, destroying the previous one first, prevents this.
* Update the comments in t/op/magic.tNicholas Clark2013-08-291-1/+2
| | | | | (This should have been part of the previous commit, but I forgot to --amend it)
* ${^MPEN} had been treated as a synonym of ${^MATCH} due to a missing break;Nicholas Clark2013-08-292-1/+9
| | | | | | A missing break; in Perl_gv_fetchpvn_flags() meant that the variable ${^MPEN} had been behaving as a synonym of ${^MATCH}. Adding the break makes ${^MPEN} behave like all other unused multi-character control character variable.
* perldelta - Rewordings in the light of fa3234e35dSteve Hay2013-08-291-4/+4
|