summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* utf8.h, utfebcdic.h: Add some assertionsKarl Williamson2015-09-041-4/+4
| | | | | | These will detect a array bounds error that occurs on EBCDIC machines, and by including the assert on non-EBCDIC, we verify that the code wouldn't fail when built on EBCDIC.
* utf8.h: Add comment; white space changesKarl Williamson2015-09-041-3/+5
|
* utf8.h: Change definition of UTF8_IS_INVARIANTKarl Williamson2015-09-041-5/+8
| | | | The new definition is simpler to read.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-10/+6
| | | | | | This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC platforms to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-14/+14
| | | | | | | | Prior to this commit UVCHR_SKIP() was defined the same in both ASCII and EBCDIC, but they expanded to different things. Now, they are defined separately -- to what they expand to, and the EBCDIC version is changed when all expanded out to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-6/+7
| | | | | | | | Prior to this commit UVCHR_IS_INVARIANT() was defined the same in both ASCII and EBCDIC, but they expanded to different things. Now, they are defined separately to what they expand to, and the EBCDIC version is changed when all expanded out to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* utf8.h: Change defn of UNI_IS_INVARIANTKarl Williamson2015-09-041-2/+2
| | | | | This changes it to be isASCII(), instead of repeating the "special" number 0x80.
* Change filter of problematic code points for EBCDICKarl Williamson2015-09-041-2/+5
| | | | | | | | | | | | | | There are three classes of problematic Unicode code points that may require special handling. Which code points are problematic is fairly complicated, requiring lots of branches. However, the smallest of them is 0xD800, which means that most code points in modern use are below them all, and a single test can be used to exclude just about everything likely to be encountered. The problem was that the way this test was done on EBCDIC caused way too many things to pass and have to be checked with the more complicated branches. The digits 0-9 and some capital letters were not filtered out, for example. This commit changes the EBCDIC test to transform into I8 (an array lookup), and this fixes it to exclude things that shouldn't have passed before.
* Add macro for converting Latin1 to UTF-8, and use itKarl Williamson2015-09-041-0/+15
| | | | | | | | | This adds a macro that converts a code point in the ASCII 128-255 range to UTF-8, and changes existing code to use it when the range is known to be restricted to this one, rather than the previous macro which accepted a wider range (any code point representable by 2 bytes), but had an extra test on EBCDIC platforms, hence was larger than necessary and slightly slower.
* utf8.h: Add assertions to macroKarl Williamson2015-09-041-1/+3
|
* perlapi: Add some S<>Karl Williamson2015-09-031-3/+3
| | | | | so that these constructs appear on a single output line for reader convenience.
* utf8.h: Add macro synonymKarl Williamson2015-09-031-0/+3
| | | | | | | | | The new name is a more accurate description of what it does. I meant to apply this commit before d0664088be143e921b2e717524bafddf6a406029, but somehow it didn't happen. On EBCDIC platforms, that commit will fail to compile without this one, so could be a problem if ever bisecting on one.
* utf8.h: Add dummy param for when macros placed in APIKarl Williamson2015-08-011-7/+11
| | | | | | | | These macros are not in the public API, but they might be someday. We may want to check for valide UTF-8 at some point. Add a parameter so that is possible then without having to change the API. This also changes to use the short name of one macro
* utf8.h: Fix typo in macro name definitionKarl Williamson2015-08-011-1/+1
| | | | The trailing underscore was unintended.
* utf8.h, utfebcdic.h: Add comments; white-space onlyKarl Williamson2015-08-011-7/+12
|
* utf8.h: Add UTF8_SKIP as a synonym for UTF8SKIPKarl Williamson2015-08-011-1/+2
| | | | | Most of the other names in utf8.h have an underscore; this allows someone to keep things consistent in their code.
* utf8.h: White-space onlyKarl Williamson2015-08-011-7/+7
| | | | Align some macro definitions vertically to make it easier to read
* Handle Unicode 3.0.1 /i Turkish "i" rulesKarl Williamson2015-07-281-0/+2
| | | | | | | Actually, there are no special rules for this Unicode release. All the 4 "i" characters are considered equivalent under /i only in this release. (Upper and lowercase dotted and dotless "i"). This adds special cases that are only compiled in for that release.
* Allow Perl to compile and work on Unicode releases without U+1E9EKarl Williamson2015-07-281-1/+3
| | | | | | | | | | | | U+1E9E LATIN CAPITAL LETTER SHARP S is handled specially by Perl, because of its relationship to the infamous LATIN SMALL LETTER SHARP S, which folds to 'ss', being the only character whose code point is < 256 to have a multi char fold, and this creates lots of special cases. But U+1E9E wasn't in all Unicode releases. Because Perl is supposed to work with any release, we need to be able to compile when this character isn't defined. In some of those cases we use U+017F (LATIN SMALL LETTER LONG S instead, which is in all releases.
* perlapi: Add intro text to Unicode sectionKarl Williamson2015-05-071-0/+6
|
* perlapi: Document some functionsKarl Williamson2015-05-071-2/+30
| | | | | These are mentioned in some other pods. It's best to bring them into perlapi, and refer to them from the other pods.
* utf8.h: Add a #defineKarl Williamson2015-05-071-2/+3
| | | | | | The name UVCHR... parallels the usage of various functions uvchr... It's less confusing to keep the same name form for the same type of functionality
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* fix assertions for UTF8_TWO_BYTE_HI/LOHugo van der Sanden2015-02-121-3/+3
| | | | | | Replace the stricter MAX_PORTABLE_UTF8_TWO_BYTE check with a looser MAX_UTF8_TWO_BYTE check, else we can't correctly convert codepoints in the range 0x400-0x7ff from utf16 to utf8 on non-ebcdic platforms.
* foldEQ_utf8(): Add some internal flagsKarl Williamson2014-12-291-0/+2
| | | | The comments explain their purpose
* Make is_invariant_string()Karl Williamson2014-11-261-1/+14
| | | | | | This is a more accurately named synonym for is_ascii_string(), which is retained. The old name is misleading to someone programming for non-ASCII platforms.
* utf8.h: EBCDIC fixKarl Williamson2014-10-211-2/+2
| | | | | | | These macros are supposed to accommodate larger than a byte inputs. Therefore, under EBCDIC, we have to use a different macro which handles the larger values. On ASCII platforms, these called macros are no-ops so it doesn't matter there.
* Add and use macros for case-insensitive comparisonKarl Williamson2014-08-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | This adds to handy.h isALPHA_FOLD_EQ(c1,c2) which efficiently tests if c1 and c2 are the same character, case-insensitively. For example isALPHA_FOLD_EQ(c, 's') returns true if and only if <c> is 's' or 'S'. isALPHA_FOLD_NE() is also added by this commit. At least one of c1 and c2 must be known to be in [A-Za-z] or this macro doesn't work properly. (There is an assert for this in the macro in DEBUGGING builds). That is why the name includes "ALPHA", so you won't forget when using it. This functionality has been in regcomp.c for a while, under a different name. I had thought that the only reason to make it more generally available was potential speed gain, but recent gcc versions optimize to the same code, so I thought there wasn't any point to doing so. But I now think that using this makes things easier to read (and certainly shorter to type in). Once you grok what this macro does, it simplifies what you have to keep in your mind when reading logical expressions with multiple operands. That something can be either upper or lower case can be a distraction to understanding the larger point of the expression.
* utf8.h: Add commentKarl Williamson2014-07-091-1/+3
|
* perlapi: Refactor placements, headings of some functionsKarl Williamson2014-06-051-7/+0
| | | | | | | | | | | | | | It is not very user friendly to list functions as "Functions found in file FOO". Better is to group them by purpose, as many were already. I went through and placed the ones that weren't already so grouped into groups. Patches welcome if you have a better classification. I changed the headings of some so that the important disctinction was the first word so that they are placed in the file more appropriately. And a couple of ones that I had created myself, I came up with a name that I think is better than the original
* Add parameters to "use locale"Karl Williamson2014-06-051-2/+5
| | | | | | | This commit allows one to specify to enable locale-awareness for only a specified subset of the locale categories. Thus you could make a section of code LC_MESSAGES aware, with no locale-awareness for the other categories.
* Fix definition of toCTRL() for EBCDICKarl Williamson2014-05-311-0/+4
| | | | | | The definition was incorrect. When going from control to printable name, we need to go from Latin1 -> Native, so that e.g., a 65 gets turned into the native 'A'
* Add some (UN)?LIKELY() to UTF8 handlingKarl Williamson2014-05-311-3/+3
| | | | | It's very rare actually for code to be presented with malformed UTF-8, so give the compiler a hint about the likely branches.
* Make is_utf8_char_buf() a macroKarl Williamson2014-05-311-0/+2
| | | | | | This function is now more efficiently implemented as a synonym for isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code that calls it that way.
* utf8.h: Use new macro type from previous commitKarl Williamson2014-05-311-35/+25
| | | | | | | | This allows for an efficient isUTF8_CHAR macro, which does its own length checking, and uses the UTF8_INVARIANT macro for the first byte. On EBCDIC systems this macro which does a table lookup is quite a bit more efficient than all the branches that would normally have to be done.
* Create isUTF8_CHAR() macro and use itKarl Williamson2014-05-311-13/+39
| | | | | | | | | | | | | | | | | | This macro will inline the code to determine if a character is well-formed UTF-8 for code points below a certain value, falling back to a slower function for larger ones. On ASCII platforms, it will inline for well-beyond all legal Unicode code points. On EBCDIC, it currently does it for code points up to 0x3FFF. This could be increased, but our porting tests do the regen every time to make sure everything is ok, and making it larger slows that down. This is worked around on ASCII by normally commenting out the code that generates this info, but including in utf8.h a version that did get generated. This is static information and won't change. (This could be done for EBCDIC too, but I chose not to at this time as each code page has a different macro generated, and it gets ugly getting all of them in utf8.h) Using this macro allowed for simplification of several functions in utf8.c
* utf8.h: Move macro within fileKarl Williamson2014-05-311-7/+8
| | | | This places it in a better situated spot for later commits
* regen/regcharclass.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-1/+1
| | | | | This causes the generated regcharclass.h to be valid on all supported platforms
* White-space, comments onlyKarl Williamson2014-01-271-1/+1
| | | | | | | This mostly indents and outdents base on blocks added or removed by the previous commit. But there are a few comment changes and vertical alignment of macro backslash continuation characters, and other white-space changes
* Rename an internal flagKarl Williamson2014-01-271-1/+1
| | | | | The UTF8 in the name is kind of misleading, and would be more misleading after future commits make UTF8 locales special.
* Taint more operands with case changesKarl Williamson2014-01-271-5/+4
| | | | | | | | | | The documentation says that Perl taints certain operations when subject to locale rules, such as lc() and ucfirst(). Prior to this commit there were exceptions when the operand to these functions contained no characters whose case change actually varied depending on the locale, for example the empty string or above-Latin1 code points. Changing to conform to the documentation simplifies the core code, and yields more consistent results.
* Change some warnings in utf8n_to_uvchr()Karl Williamson2014-01-011-1/+3
| | | | | | | | | | | | | | | | This bottom level function decodes the first character of a UTF-8 string into a code point. It is discouraged from using it directly. This commit cleans up some of the warnings it can raise. Now, tests for malformations are done before any tests for other potential issues. One of those issues involves code points so large that they have never appeared in any official standard (the current standard has scaled back the highest acceptable code point from earlier versions). It is possible (though not done in CPAN) to warn and/or forbid these code points, while accepting smaller code points that are still above the legal Unicode maximum. The warning message for this now includes the code point if representable on the machine. Previously it always displayed raw bytes, which is what it still does for non-representable code points.
* Move a macro from utf8.h to handy.h for wider use.Karl Williamson2014-01-011-10/+0
| | | | Future commits will want this available outside utf8.h
* utf8.h: Add parameter checking to some macros in DEBUGGING buildsKarl Williamson2013-12-061-23/+51
| | | | | | This change should catch some wrong calls to these macros. The meat of the macros is extracted out into two internal-only macros, and the other macros are rearranged to call these.
* utf8.h: Fix grammar in commentKarl Williamson2013-12-041-2/+2
|
* utf8.h: White-space onlyKarl Williamson2013-09-301-1/+2
| | | | I believe this makes the macro easier to read
* The choice of 7 or 13 byte extended UTF-8 should be based on UVSIZE.Nicholas Clark2013-09-171-5/+3
| | | | Previously it was based on HAS_QUAD, which is not (as) correct.
* Use separate macros for byte vs uv UnicodeKarl Williamson2013-09-101-1/+6
| | | | | | | This removes a macro not yet even in a development release, and splits its calls into two classes: those where the input is a byte; and those where it can be any unsigned integer. The byte implementation avoids a function call on EBCDIC platforms.
* PATCH: [perl #119601] Bleadperl breaks ETHER/Devel-DeclareKarl Williamson2013-09-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | I will not otherwise mention that stealing .c code from the core is a dangerous practice. This is actually a bug in the module, which had been masked until now. The first two parameters to utf8_to_uvchr_buf() are both U8*. But both 's' and PL_bufend are char*. The 's' has a cast to U8* in the failing line, but not PL_bufend. Interestingly, the line in the official toke.c (introduced in 4b88fb76) has always been right, so the stealer didn't copy it correctly. What de69f3af3 did was turn this former function call into a macro that manipulates the parameters and calls another function, thereby removing a layer of function call overhead. The manipulation involves subtracting 's' from PL_bufend, and this fails to compile due to the missing cast on the latter parameter. The problem goes away if the macro casts both parameters to U8*, and that is what this commit does.
* utf8.h: White space onlyKarl Williamson2013-08-291-6/+7
| | | | Vertically align the definitions of a few #defines