summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* Refactor UTF_START_MARK()Karl Williamson2021-05-301-5/+10
| | | | | This allows the removal of a conditional in a very low level (called a lot) macro
* UTF8_IS_NEXT_CHAR_DOWNGRADEABLE() check before derefKarl Williamson2021-05-291-2/+2
| | | | Reorder the clauses to check first before dereferencing
* utf8.h: Simplify UNICODE_IS_SURROGATE()Karl Williamson2021-05-281-4/+3
| | | | | This uses inRANGE() with mnemonics to make it clearer with no increase in the number of conditionals
* utf8.h: Use inRANGE for UNICODE_IS_32_CONTIGUOUS_NONCHARSKarl Williamson2021-05-281-2/+2
| | | | This leads to a single conditional instead of two.
* utf8.h: Refactor UNICODE_IS_NONCHAR()Karl Williamson2021-05-281-3/+3
| | | | | | | | | | This adds branch prediction and re-orders so that an unlikely to succeed test is done before the likely to succeed one, so that the latter usually doesn't need to be executed. Since both conditions must succeed for the entire expression to succeed, this doesn't change what the whole expresson matches. s# Please enter the commit message for your changes. Lines starting
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-2/+2
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* utf8.h: Fix syntax error only found on EBCDIC buildsKarl Williamson2020-12-041-1/+1
|
* autodoc.pl: Specify scn for single-purpose filesKarl Williamson2020-11-061-8/+0
| | | | | | | | Many of the files in perl are for one thing only, and hence their embedded documentation will be for that one thing. By creating a hash here of them, those files don't have to worry about what section that documentation goes under, and so it can be completely changed without affecting them.
* Change some link pod for better renderingKarl Williamson2020-08-311-7/+7
| | | | C<L</foo>> renders better in places than L</C<foo>>
* Document ibcmp_utf8, and move to like-fcns hdrKarl Williamson2020-08-221-3/+0
|
* utf8.h: Add commentKarl Williamson2020-07-311-0/+1
|
* utf8.h: Remove obsolete macroKarl Williamson2020-07-301-7/+0
| | | | | | It turns out that this macro would have failed to compile since commit 538b546eb0f252250a30c08e6af47d0ea7433fa1, in October 2013. So it is clear no one is using it.
* Fix typo when using nBIT_UMAXNicolas R2020-07-221-1/+1
| | | | | | | | nBIT_MAX was used instead of nBIT_UMAX from d223e1ea9ae864c0e563187f1e76 changes note: at first glance it seems that nBIT_UMAX is an alias for nBIT_MASK
* utf8.h: Add some branch predictionsKarl Williamson2020-07-171-20/+26
| | | | | It is likely that the data will be well-formed Unicode, and not one of its special characters, like surrogates or non-characters, nor NUL.
* handy.h: Create nBIT_MASK(n) macroKarl Williamson2020-07-171-2/+2
| | | | | This encapsulates a common paradigm, making sure that it is done correctly for the platform's size.
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-1/+1
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* pv_uni_display: Use common fcn; \b mnemonicKarl Williamson2020-01-231-1/+7
| | | | | | | | | | | This removes the (almost) duplicate code in this function to display mnemonics for control characters that have them. The reason the two pieces of code aren't precisely the same is that the other function also uses \b as a mnemonic for backspace. Using all possible mnemonics is desirable, so a flag is added for pv_uni_display to now use \b. This is now by default enabled in double-quoted strings, but not regex patterns (as \b there means something quite different except in character classes). B.pm is changed to expect \b.
* Fix UTF8_IS_START on EBCDICKarl Williamson2019-12-071-3/+11
|
* utf8.h: Rmv obsolete macrosKarl Williamson2019-11-241-16/+0
| | | | | | | These macros were missed in dd1a3ba7882ca70c1e85b0fd6c03d07856672075 and 059703b088f44d5665f67fba0b9d80cad89085fd. Using them would cause things to fail to compile
* Add missing back compat macrosKarl Williamson2019-11-241-0/+1
| | | | These are needed only to allow some modules to stay updated with blead.
* utf8.h: Use MAX() macro instead of its expansionKarl Williamson2019-11-141-3/+1
| | | | It makes things a little clearer.
* utf8.h: Use a cast to U8 to avoid an ANDKarl Williamson2019-11-111-1/+1
|
* Allow core to work with code points above IV_MAXKarl Williamson2019-11-061-0/+4
| | | | | Higher has been reserved for core use, and a future commit will want to finally do this.
* Consolidate uses of PERL_SMALL_MACRO_BUFFERKarl Williamson2019-10-311-17/+2
| | | | | | | At the moment the _ASSERT_() is the one which has been showing large expansions. Change so it doesn't do anything if PERL_SMALL_MACRO_BUFFER is defined. That means various other calls that use PERL_SMALL_MACRO_BUFFER can be simplified to not use it.
* Document UTF8_MAXBYTES_CASEKarl Williamson2019-10-151-8/+18
|
* Fix UTF8_CHK_SKIP()Karl Williamson2019-10-131-1/+2
| | | | I forgot an arg in a macro it calls.
* Add UTF8_CHK_SKIP() macroKarl Williamson2019-10-091-2/+40
| | | | | | | | | | | | | This is a safer version of UTF8SKIP for use when the input could be possibly malformed. It uses strnlen() to not read past a NUL in the input. Since Perl adds NULs to the end of SV's, this will likely prevent reading beyond the end of a buffer. A still safer version could be written that doesn't look for just a NUL, but any unexpected byte, and stops just before that. I suspect that is overkill, and since strnlen() can be very fast, I went with this approach instead. Nothing precludes adding another version that does this full checking
* Document UTF8_SKIP()Karl Williamson2019-10-091-0/+8
|
* Make defn of UTF_IS_CONTINUED commonKarl Williamson2019-10-061-6/+5
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UVCHR_IS_INVARIANT commonKarl Williamson2019-10-061-13/+13
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of OFFUNI_IS_INVARIANT commonKarl Williamson2019-10-061-8/+5
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_DOWNGRADEABLE_START commonKarl Williamson2019-10-061-8/+7
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_IS_ABOVE_LATIN1 commonKarl Williamson2019-10-061-6/+8
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_START commonKarl Williamson2019-10-061-7/+10
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_CONTINUATION commonKarl Williamson2019-10-061-6/+6
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_CONTINUATION_MARK commonKarl Williamson2019-10-061-4/+6
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_IS_CONTINUATION_MASK commonKarl Williamson2019-10-061-3/+4
| | | | | This variable can be defined from the same base in both UTF-8 and UTF-EBCDIC, and doing so eliminates an EBCDIC dependency.
* utf8.h: Add commentKarl Williamson2019-10-061-1/+3
|
* utf8.h: Remove redundant castKarl Williamson2019-10-061-1/+1
| | | | The called macro does the cast already
* utf8.h: Make sure macros not called with a ptrKarl Williamson2019-10-061-8/+8
| | | | | By doing an '| 0' with a parameter in a macro expansion, a C syntax error will be generated. This is free protection.
* Remove deprecated character classification/case changing macrosKarl Williamson2019-09-291-9/+0
| | | | | | | | | | | | | | It has been deprecated since 5.26 to use various macros that deal with UTF-8 inputs but don't have a parameter indicating the maximum length beyond which we should not look. This commit changes all such macros, as threatened in existing documentation and warning messages, to have an extra parameter giving the length. This was originally scheduled to happen in 5.30, but was delayed because it broke some CPAN modules, and there wasn't really a good way around it. But now that Devel::PPPort 3.54 is out, ppport.h has new facilities for getting modules making these changes to work with older Perl releases.
* Change name of _utf8_to_uvchr_buf()Karl Williamson2019-09-151-1/+1
| | | | | A function name with a leading underscore is not legal in C. Instead add a suffix to differentiate this name from an otherwise identical one.
* Strip leading underscore from is_utf8_char_helper()Karl Williamson2019-09-151-3/+3
| | | | Leading underscored name are reserved for the C implementers
* Document UTF8_MAXBYTESKarl Williamson2019-09-021-5/+13
|
* perlapi: Document UNICODE_REPLACEMENTKarl Williamson2019-09-021-0/+5
|
* perlapi: Document NATIVE_to_foo, converse functionsKarl Williamson2019-09-021-3/+45
|
* utf8_to_uvchr_buf() make behavior match docsKarl Williamson2019-07-011-3/+1
| | | | | | | | | | | For well formed input, there is no change. But for malformed it wasn't returning the documented length when warnings were enabled, and not always the documented value when they were disabled. This is implemented as an inline function, called from both the macro and the Perl_ form. Devel::PPPort has sufficient tests for this.
* PATCH: [perl #133896] Assertion failureKarl Williamson2019-04-051-4/+7
| | | | | | | This was due to UTF8_SAFE_SKIP(s, e) not allowing s to be as large as e, and there are legitimate cases where it can be. This commit hardens the macro so that it never reads above e-1, returning 0 if it otherwise would be required to. The assertion is changed to 's <= e'.
* Add UTF8_SAFE_SKIP API macroKarl Williamson2019-03-131-0/+11
| | | | This version of UTF8SKIP refuses to advance beyond the end pointer
* Remove relics of regex swash useKarl Williamson2019-02-141-5/+0
| | | | | | | | | | | This removes the most obvious and easy things that are no longer needed since regexes no longer use swashes at all. tr/// continues, for the time being, to use swashes, so not all swash handling is removable now. But tr/// doesn't use inversion lists, and so a bunch of code is ripped out here. Other code could have been, but I did only the relatively easy stuff. The rest can be ripped out all at once when tr/// is stops using swashes.