summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* Add utf8_to_utf16Karl Williamson2021-08-141-0/+4
|
* Improve utf16_to_utf8_reversed()Karl Williamson2021-08-141-0/+5
| | | | | | Instead of destroying the input by first swapping the bytes, this calls a base function with the order to use. The non-reverse function is changed to call the base function with the non-reversed order.
* Make macro isUTF8_CHAR_flags an inline fcnKarl Williamson2021-08-141-39/+0
| | | | This makes it use the fast DFA for this functionality.
* utf8.h: Comment changesKarl Williamson2021-08-071-10/+21
|
* utf8.h: White space onlyKarl Williamson2021-08-071-19/+19
|
* utf8.h: Refactor UTF8_IS_NONCHAR...Karl Williamson2021-08-071-7/+6
| | | | | | UTF8_IS_NONCHAR_GIVEN_THAT_NON_SUPER_AND_GE_PROBLEMATIC() is defined just for backward compatability (though I don't think anyone uses it). Swap which macro is the base level that the other is defined in terms of
* Refactor UTF8_IS_SUPER()Karl Williamson2021-08-071-20/+14
| | | | | | This uses macros recently introduced to remove an EBCDIC dependency and make the definition simpler. It now uses the DFA, which should speed up the non-edge case uses.
* utf8.h: Document some #definesKarl Williamson2021-08-071-0/+37
| | | | | The reorganization in the previous commit revealed some undocumented public macros
* utf8.h: Move some #defines aroundKarl Williamson2021-08-071-116/+119
| | | | | | This moves the defines for things like surrogates, non-character code points, etc. to a more logical order, with like adjacent to like, and before they are otherwise used in the file.
* utf8.h: Remove EBCDIC dependencyKarl Williamson2021-08-071-7/+10
| | | | By generalizing a macro, we can make it serve both ASCII and EBCDIC
* utf8.h: Add macros to calc UTF start byte, first contKarl Williamson2021-08-071-4/+41
| | | | | These two bytes are useful to know in some situations. This commit changes a couple such places to use the first macro.
* utf8.h: Reorder some preprocessor directivesKarl Williamson2021-08-071-14/+10
| | | | This is just so that things are clearer to the reader
* utf8.h: Add #defineKarl Williamson2021-08-071-1/+5
| | | | UTF_MIN_CONTINUATION_BYTE is clearer for use in some contexts
* utf8.h: Move all SKIP functions to be near each otherKarl Williamson2021-08-071-9/+8
| | | | For convenient code reading
* utf8.h: Add new #define for extended length UTF-8Karl Williamson2021-08-071-0/+1
| | | | | | | | The previous commit added a convenient place to create a symbol to indicate that the UTF-8 on this platform includes Perl's nearly-double length extension. The platforms this isn't needed on are 32-bit ASCII ones. This symbol allows removing one place where EBCDIC need be considered, and future commits will use it as well.
* utf8.h: Refactor MAX_UTF8_TWO_BYTEKarl Williamson2021-08-071-3/+11
| | | | | | The previous commit removed a macro that the comments for this refer to in explaining its derivation. So use an alternative, that is actually clearer.
* Reimplement OFFUNISKIPKarl Williamson2021-08-071-47/+23
| | | | | | | Now that previous commits have made it fast to find the position of the first set bit in a word, we can use a forumla to find how many bytes the UTF-8 of that will occupy. This allows for simplification of this macro, removing several conditionals
* utf8.h: Add macro to compute UV skip by its log2Karl Williamson2021-08-071-2/+30
| | | | | | | | | | This macro will calculate at compile time, if passed a compile-time constant, how many UTF-8 bytes are required to represent the parameter. The macro is a helper which works fine except for edge cases, which a wrapper is needed to handle. The commit changes one instance to use this new macro
* utf8.h: Rmv EBCDIC dependencyKarl Williamson2021-08-071-11/+59
| | | | | | | This moves a #define into the common code for ASCII and EBCDIC machines. It adds a bunch of comments about the value that I wish I hadn't had to figure out for myself.
* Rename internal macro and move to utf8.hKarl Williamson2021-08-071-0/+2
| | | | | | This macro has a corresponding, older, name for the non-UTF-8 case. It makes sense to use the same paradigm, and move the definitions together so that the comments for one don't have to be repeated for the other.
* utf8.h: Remove an EBCDIC dependencyKarl Williamson2021-08-071-2/+19
| | | | | A symbol introduced in a previous commit allows this internal macro to only need a single version, suitable for either EBCDIC or ASCII.
* utf8.h: Add symbol for easing EBCDIC handlingKarl Williamson2021-08-071-0/+6
| | | | This is then used in regcomp.c to avoid an #ifdef EBCDIC
* utf8.h: Make a bit of EBCDIC known to ASCIIKarl Williamson2021-08-071-4/+15
| | | | | This info is needed in one other place; doing it here means only specifying it once.
* utf8.h: Add a #define synonymKarl Williamson2021-08-071-3/+9
| | | | | This is more clearly named for various uses in this file. It has an unwieldy length, but is unlikely to be used outside it.
* Refactor UTF_START_MASK()Karl Williamson2021-08-071-5/+14
| | | | | | | | A slight change to this very low level macro (hence called a lot) removes the need for a conditional, and causes it to work on single-byte UTF-8 characters on ASCII platforms. The definition is also moved to a more logical place in the file
* utf8.h: Move macro to earlier in fileKarl Williamson2021-08-071-13/+13
| | | | This is now defined before first use
* UTF8_IS_DOWNGRADEABLE_START: Call less general helperKarl Williamson2021-08-071-1/+1
| | | | | | Future commits would otherwise make the expansion of this macro too complicated for some C compilers. Use a less general internal helper function to avoid that.
* Refactor UTF_START_MARK()Karl Williamson2021-05-301-5/+10
| | | | | This allows the removal of a conditional in a very low level (called a lot) macro
* UTF8_IS_NEXT_CHAR_DOWNGRADEABLE() check before derefKarl Williamson2021-05-291-2/+2
| | | | Reorder the clauses to check first before dereferencing
* utf8.h: Simplify UNICODE_IS_SURROGATE()Karl Williamson2021-05-281-4/+3
| | | | | This uses inRANGE() with mnemonics to make it clearer with no increase in the number of conditionals
* utf8.h: Use inRANGE for UNICODE_IS_32_CONTIGUOUS_NONCHARSKarl Williamson2021-05-281-2/+2
| | | | This leads to a single conditional instead of two.
* utf8.h: Refactor UNICODE_IS_NONCHAR()Karl Williamson2021-05-281-3/+3
| | | | | | | | | | This adds branch prediction and re-orders so that an unlikely to succeed test is done before the likely to succeed one, so that the latter usually doesn't need to be executed. Since both conditions must succeed for the entire expression to succeed, this doesn't change what the whole expresson matches. s# Please enter the commit message for your changes. Lines starting
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-2/+2
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* utf8.h: Fix syntax error only found on EBCDIC buildsKarl Williamson2020-12-041-1/+1
|
* autodoc.pl: Specify scn for single-purpose filesKarl Williamson2020-11-061-8/+0
| | | | | | | | Many of the files in perl are for one thing only, and hence their embedded documentation will be for that one thing. By creating a hash here of them, those files don't have to worry about what section that documentation goes under, and so it can be completely changed without affecting them.
* Change some link pod for better renderingKarl Williamson2020-08-311-7/+7
| | | | C<L</foo>> renders better in places than L</C<foo>>
* Document ibcmp_utf8, and move to like-fcns hdrKarl Williamson2020-08-221-3/+0
|
* utf8.h: Add commentKarl Williamson2020-07-311-0/+1
|
* utf8.h: Remove obsolete macroKarl Williamson2020-07-301-7/+0
| | | | | | It turns out that this macro would have failed to compile since commit 538b546eb0f252250a30c08e6af47d0ea7433fa1, in October 2013. So it is clear no one is using it.
* Fix typo when using nBIT_UMAXNicolas R2020-07-221-1/+1
| | | | | | | | nBIT_MAX was used instead of nBIT_UMAX from d223e1ea9ae864c0e563187f1e76 changes note: at first glance it seems that nBIT_UMAX is an alias for nBIT_MASK
* utf8.h: Add some branch predictionsKarl Williamson2020-07-171-20/+26
| | | | | It is likely that the data will be well-formed Unicode, and not one of its special characters, like surrogates or non-characters, nor NUL.
* handy.h: Create nBIT_MASK(n) macroKarl Williamson2020-07-171-2/+2
| | | | | This encapsulates a common paradigm, making sure that it is done correctly for the platform's size.
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-1/+1
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* pv_uni_display: Use common fcn; \b mnemonicKarl Williamson2020-01-231-1/+7
| | | | | | | | | | | This removes the (almost) duplicate code in this function to display mnemonics for control characters that have them. The reason the two pieces of code aren't precisely the same is that the other function also uses \b as a mnemonic for backspace. Using all possible mnemonics is desirable, so a flag is added for pv_uni_display to now use \b. This is now by default enabled in double-quoted strings, but not regex patterns (as \b there means something quite different except in character classes). B.pm is changed to expect \b.
* Fix UTF8_IS_START on EBCDICKarl Williamson2019-12-071-3/+11
|
* utf8.h: Rmv obsolete macrosKarl Williamson2019-11-241-16/+0
| | | | | | | These macros were missed in dd1a3ba7882ca70c1e85b0fd6c03d07856672075 and 059703b088f44d5665f67fba0b9d80cad89085fd. Using them would cause things to fail to compile
* Add missing back compat macrosKarl Williamson2019-11-241-0/+1
| | | | These are needed only to allow some modules to stay updated with blead.
* utf8.h: Use MAX() macro instead of its expansionKarl Williamson2019-11-141-3/+1
| | | | It makes things a little clearer.
* utf8.h: Use a cast to U8 to avoid an ANDKarl Williamson2019-11-111-1/+1
|
* Allow core to work with code points above IV_MAXKarl Williamson2019-11-061-0/+4
| | | | | Higher has been reserved for core use, and a future commit will want to finally do this.