summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* utf8.h: Add commentKarl Williamson2023-03-281-2/+2
|
* utf8.h: Rmv improper (STRLEN) castKarl Williamson2023-02-261-1/+1
| | | | | | | | An enum is supposed to be an int. STRLEN may be too large for that. (I suspect the cast was inadvertently left in converting from an earlier implementation that was never released.) Spotted by H. Merijn Brand.
* Add utf8ness_t typedefKarl Williamson2022-08-221-0/+77
| | | | This will be used in future commits
* utf8.h: Add unsigned cast to array index parameterKarl Williamson2022-06-021-1/+1
| | | | this makes sure it is never negative
* utf8.h: fixup signed/unsigned warning from UTF8_IS_REPLACEMENTYves Orton2022-03-031-3/+4
| | | | Added a SSize_t cast to the sizeof() call.
* Add ASSERT_IS_PTR macroKarl Williamson2022-03-031-1/+1
| | | | To make sure at compile time that its argument is a ptr
* Fix UTF8_IS_REPLACEMENT() and add testsKarl Williamson2022-03-011-1/+4
| | | | | | | This macro relied on a now-removed other macro in 2019, 216dc346ceeeb9b6ba0fdd470ccfe4f8b2a286c4. Fix it and add tests. This bug was caught by Devel::PPPort
* utf8.h: Rmv redundant assertsKarl Williamson2021-09-161-8/+4
| | | | | | | These macros asserted both that the passed in parameter occupied no more than a byte, and that it wasn't a pointer. But pointers occupy more than a byte, so if it passes the first check, meaning it occupies only a byte, it will necessarily pass the second, making that check unnecessary.
* Make paradigm into a macroKarl Williamson2021-08-251-10/+12
| | | | | | | | | | | | | | | | | | | | These macros use (x) | 0 to get a compiler error if x is a pointer rather than a value. This was instituted because there was confusion in them as to what they were called with. But the purpose of the paradigm wasn't obvious to even some experts; it was documented in every file in which it was used, but not at every occurrence. And, not every compiler can cope with them, it turns out. Making the paradigm into a macro, which this commit does, makes the uses self-documenting, albeit at the expense of cluttering up the macro definition somewhat; and allows the mechanism to be turned off if necessary for some compilers. Since it will be enabled for the majority of compilers, the potential bugs will be caught anyway.
* Add utf8_to_utf16Karl Williamson2021-08-141-0/+4
|
* Improve utf16_to_utf8_reversed()Karl Williamson2021-08-141-0/+5
| | | | | | Instead of destroying the input by first swapping the bytes, this calls a base function with the order to use. The non-reverse function is changed to call the base function with the non-reversed order.
* Make macro isUTF8_CHAR_flags an inline fcnKarl Williamson2021-08-141-39/+0
| | | | This makes it use the fast DFA for this functionality.
* utf8.h: Comment changesKarl Williamson2021-08-071-10/+21
|
* utf8.h: White space onlyKarl Williamson2021-08-071-19/+19
|
* utf8.h: Refactor UTF8_IS_NONCHAR...Karl Williamson2021-08-071-7/+6
| | | | | | UTF8_IS_NONCHAR_GIVEN_THAT_NON_SUPER_AND_GE_PROBLEMATIC() is defined just for backward compatability (though I don't think anyone uses it). Swap which macro is the base level that the other is defined in terms of
* Refactor UTF8_IS_SUPER()Karl Williamson2021-08-071-20/+14
| | | | | | This uses macros recently introduced to remove an EBCDIC dependency and make the definition simpler. It now uses the DFA, which should speed up the non-edge case uses.
* utf8.h: Document some #definesKarl Williamson2021-08-071-0/+37
| | | | | The reorganization in the previous commit revealed some undocumented public macros
* utf8.h: Move some #defines aroundKarl Williamson2021-08-071-116/+119
| | | | | | This moves the defines for things like surrogates, non-character code points, etc. to a more logical order, with like adjacent to like, and before they are otherwise used in the file.
* utf8.h: Remove EBCDIC dependencyKarl Williamson2021-08-071-7/+10
| | | | By generalizing a macro, we can make it serve both ASCII and EBCDIC
* utf8.h: Add macros to calc UTF start byte, first contKarl Williamson2021-08-071-4/+41
| | | | | These two bytes are useful to know in some situations. This commit changes a couple such places to use the first macro.
* utf8.h: Reorder some preprocessor directivesKarl Williamson2021-08-071-14/+10
| | | | This is just so that things are clearer to the reader
* utf8.h: Add #defineKarl Williamson2021-08-071-1/+5
| | | | UTF_MIN_CONTINUATION_BYTE is clearer for use in some contexts
* utf8.h: Move all SKIP functions to be near each otherKarl Williamson2021-08-071-9/+8
| | | | For convenient code reading
* utf8.h: Add new #define for extended length UTF-8Karl Williamson2021-08-071-0/+1
| | | | | | | | The previous commit added a convenient place to create a symbol to indicate that the UTF-8 on this platform includes Perl's nearly-double length extension. The platforms this isn't needed on are 32-bit ASCII ones. This symbol allows removing one place where EBCDIC need be considered, and future commits will use it as well.
* utf8.h: Refactor MAX_UTF8_TWO_BYTEKarl Williamson2021-08-071-3/+11
| | | | | | The previous commit removed a macro that the comments for this refer to in explaining its derivation. So use an alternative, that is actually clearer.
* Reimplement OFFUNISKIPKarl Williamson2021-08-071-47/+23
| | | | | | | Now that previous commits have made it fast to find the position of the first set bit in a word, we can use a forumla to find how many bytes the UTF-8 of that will occupy. This allows for simplification of this macro, removing several conditionals
* utf8.h: Add macro to compute UV skip by its log2Karl Williamson2021-08-071-2/+30
| | | | | | | | | | This macro will calculate at compile time, if passed a compile-time constant, how many UTF-8 bytes are required to represent the parameter. The macro is a helper which works fine except for edge cases, which a wrapper is needed to handle. The commit changes one instance to use this new macro
* utf8.h: Rmv EBCDIC dependencyKarl Williamson2021-08-071-11/+59
| | | | | | | This moves a #define into the common code for ASCII and EBCDIC machines. It adds a bunch of comments about the value that I wish I hadn't had to figure out for myself.
* Rename internal macro and move to utf8.hKarl Williamson2021-08-071-0/+2
| | | | | | This macro has a corresponding, older, name for the non-UTF-8 case. It makes sense to use the same paradigm, and move the definitions together so that the comments for one don't have to be repeated for the other.
* utf8.h: Remove an EBCDIC dependencyKarl Williamson2021-08-071-2/+19
| | | | | A symbol introduced in a previous commit allows this internal macro to only need a single version, suitable for either EBCDIC or ASCII.
* utf8.h: Add symbol for easing EBCDIC handlingKarl Williamson2021-08-071-0/+6
| | | | This is then used in regcomp.c to avoid an #ifdef EBCDIC
* utf8.h: Make a bit of EBCDIC known to ASCIIKarl Williamson2021-08-071-4/+15
| | | | | This info is needed in one other place; doing it here means only specifying it once.
* utf8.h: Add a #define synonymKarl Williamson2021-08-071-3/+9
| | | | | This is more clearly named for various uses in this file. It has an unwieldy length, but is unlikely to be used outside it.
* Refactor UTF_START_MASK()Karl Williamson2021-08-071-5/+14
| | | | | | | | A slight change to this very low level macro (hence called a lot) removes the need for a conditional, and causes it to work on single-byte UTF-8 characters on ASCII platforms. The definition is also moved to a more logical place in the file
* utf8.h: Move macro to earlier in fileKarl Williamson2021-08-071-13/+13
| | | | This is now defined before first use
* UTF8_IS_DOWNGRADEABLE_START: Call less general helperKarl Williamson2021-08-071-1/+1
| | | | | | Future commits would otherwise make the expansion of this macro too complicated for some C compilers. Use a less general internal helper function to avoid that.
* Refactor UTF_START_MARK()Karl Williamson2021-05-301-5/+10
| | | | | This allows the removal of a conditional in a very low level (called a lot) macro
* UTF8_IS_NEXT_CHAR_DOWNGRADEABLE() check before derefKarl Williamson2021-05-291-2/+2
| | | | Reorder the clauses to check first before dereferencing
* utf8.h: Simplify UNICODE_IS_SURROGATE()Karl Williamson2021-05-281-4/+3
| | | | | This uses inRANGE() with mnemonics to make it clearer with no increase in the number of conditionals
* utf8.h: Use inRANGE for UNICODE_IS_32_CONTIGUOUS_NONCHARSKarl Williamson2021-05-281-2/+2
| | | | This leads to a single conditional instead of two.
* utf8.h: Refactor UNICODE_IS_NONCHAR()Karl Williamson2021-05-281-3/+3
| | | | | | | | | | This adds branch prediction and re-orders so that an unlikely to succeed test is done before the likely to succeed one, so that the latter usually doesn't need to be executed. Since both conditions must succeed for the entire expression to succeed, this doesn't change what the whole expresson matches. s# Please enter the commit message for your changes. Lines starting
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-2/+2
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* utf8.h: Fix syntax error only found on EBCDIC buildsKarl Williamson2020-12-041-1/+1
|
* autodoc.pl: Specify scn for single-purpose filesKarl Williamson2020-11-061-8/+0
| | | | | | | | Many of the files in perl are for one thing only, and hence their embedded documentation will be for that one thing. By creating a hash here of them, those files don't have to worry about what section that documentation goes under, and so it can be completely changed without affecting them.
* Change some link pod for better renderingKarl Williamson2020-08-311-7/+7
| | | | C<L</foo>> renders better in places than L</C<foo>>
* Document ibcmp_utf8, and move to like-fcns hdrKarl Williamson2020-08-221-3/+0
|
* utf8.h: Add commentKarl Williamson2020-07-311-0/+1
|
* utf8.h: Remove obsolete macroKarl Williamson2020-07-301-7/+0
| | | | | | It turns out that this macro would have failed to compile since commit 538b546eb0f252250a30c08e6af47d0ea7433fa1, in October 2013. So it is clear no one is using it.
* Fix typo when using nBIT_UMAXNicolas R2020-07-221-1/+1
| | | | | | | | nBIT_MAX was used instead of nBIT_UMAX from d223e1ea9ae864c0e563187f1e76 changes note: at first glance it seems that nBIT_UMAX is an alias for nBIT_MASK
* utf8.h: Add some branch predictionsKarl Williamson2020-07-171-20/+26
| | | | | It is likely that the data will be well-formed Unicode, and not one of its special characters, like surrogates or non-characters, nor NUL.