summaryrefslogtreecommitdiff
path: root/unicode_constants.h
Commit message (Collapse)AuthorAgeFilesLines
* fix incorrect vi filetype declarations in generated filesLukas Mai2023-03-241-1/+1
| | | | | Vim's filetype declarations are case sensitive. The correct types for Perl, C, and Pod are perl, c, and pod, respectively.
* generated files - update mode lines to specify file typeElvin Aslanov2023-02-191-2/+2
| | | | | | | | | | This updates the mode-line for most of our generated files so that they include file type information so they will be properly syntax highlighted on github. This does not make any other functional changes to the files. [Note: Commit message rewritten by Yves]
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-4/+4
| | | | | | | | | | | | | | | | | This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.
* Support Unicode 15.0Unicode Consortium2022-09-281-2/+2
|
* Add arrows to paired string delimitersKarl Williamson2022-03-191-9/+9
| | | | | | | | | | | Unicode has lots of arrows of various shapes, sizes, and directions. None of them were of consequence to the Bidirectional algorithm, so none were specified as being mirrored pairs. This commit uses the generalizations already in place from previous commits to examine arrow symbols and choose which are mirrored pairs. As previously, it rejects arrows with contrary directionality, and ones without horizontal directionality.
* Add SPEAKERs paired delimitersKarl Williamson2022-03-191-9/+9
| | | | The characters with this name look good as mirrored delimiters.
* Add TELEPHONE RECEIVER paired delimitersKarl Williamson2022-03-191-9/+9
| | | | The characters with this name look good as mirrored delimiters.
* Add ERASE paired delimitersKarl Williamson2022-03-191-9/+9
| | | | The characters with this name look good as mirrored delimiters.
* Add DOUBLE TRIANGLEs paired delimitersKarl Williamson2022-03-191-9/+9
| | | | The characters with this name look good as mirrored delimiters.
* Add THREE RAYS paired delimitersKarl Williamson2022-03-191-9/+9
| | | | The characters with this name look good as mirrored delimiters.
* Add musical score paired delimitersKarl Williamson2022-03-191-9/+9
| | | | | The characters that signify the beginning and ending of Western music scores serve as good delimiters
* Add INDEX paired delimitersKarl Williamson2022-03-191-9/+9
| | | | | | The bidi-aware characters containing this word are visually suitable for being mirrored delimiters. The 'index' refers to the index finger in a hand pointing at the delimited string
* Add TURNSTILE paired delimitersKarl Williamson2022-03-191-9/+9
| | | | | The bidi-aware characters containing this word are visually suitable for being mirrored delimiters.
* Add TACK paired delimitersKarl Williamson2022-03-191-9/+9
| | | | | The bidi-aware characters containing this word are visually suitable for being mirrored delimiters.
* unicode_constants.pl: Prepare for examining SymbolsKarl Williamson2022-03-191-9/+9
| | | | | | | | | | | | | | | | | | | | | | Heretofore, the code looking for paired string delimiters has looked at punctuation, and a few symbols that Unicode gives a mirror for. But there are many more suitable-for-pairing characters in Unicode. This commit generalizes things so as to handle the extra complexities of the way symbols are named beyond the punctuation names. For example, RIGHTWARDS is sometimes used; it turns out that it also is used in one punctuation character, which was previously overlooked by this script. The generalization introduced by this commit handles almost all current Unicode symbols properly. But some symbols are barely distinguishable from their mirrors, such as a tilde and a reversed tilde. The scheme adopted here, then, makes the default for a symbol pair to not be marked as paired delimiters. The code explicitly has to specify that a given pair is to be included. The next few commits are mostly for adding ones that I thought were good.
* Add 'ELEMENT OF'/CONTAINS to paired string delimitersKarl Williamson2022-03-191-9/+9
| | | | | This commit adds 8 pairs of symbols that are variants on ELEMENT OF These make nice paired delimiters in the vein of < >
* Add SUBSET/SUPERSET to paired string delimitersKarl Williamson2022-03-191-9/+9
| | | | | This commit adds 20 pairs of symbols that are variants on SUBSET These make nice paired delimiters in the vein of < >
* Add PRECEDES/SUCCEEDS to paired string delimitersKarl Williamson2022-03-191-9/+9
| | | | | This commit adds 15 pairs of symbols that are variants on PRECEDES. These look a lot like <>, so makes sense to make them paired delimiters.
* Add SMALLER THAN to paired string delimitersKarl Williamson2022-03-191-9/+9
| | | | | This commit adds 2 pairs of symbols that are variants on SMALLER THAN. These look a lot like <>, so makes sense to make them paired delimiters.
* unicode_constants.pl: Consider all \pP for delimsKarl Williamson2022-03-191-9/+9
| | | | | | | | | | | | | Previously, only the punctuation characters that Unicode had classed as being opening/closing were considered in looking for suitable paired delimiters. This commit looks at all punctuation characters. There are actually only 7 new pairs found. This gives us ꧁ ꧂ as string delimiterss, if your font allows, which are Javanese and used to surround an honorific title, according to Wikipedia.
* Add < > variants to paired delimitersKarl Williamson2022-03-191-9/+9
| | | | | | | Perl considers '< >' to be delimiters for strings; this commit adds most of the Unicode variants of these to also be string delimiters. The ones that are combinations of both < and >, aren't included, as that would be visually confusing.
* unicode_constants.pl: Add REVERSED punctuationKarl Williamson2022-03-191-9/+9
| | | | | | | | Besides LEFT/RIGHT, horizontal directionality can be specified by Unicode in names by the presence or absence of REVERSED. Enhancing the algorithm to take this into account adds 2 pairs or mirrored delimiters that were previously overlooked.
* unicode_constants.pl: Refactor to catch more paired delimsKarl Williamson2022-03-191-9/+9
| | | | | | | | | | | | | | | | | | | | | | | Previously, only characters that Unicode included in its bidirectional algorithm have been eligible to be found by this program to be mirrored string delimiters. This commit adds 5 quotation marker character pairs that are omitted from the bidirectional algorithm, as most quotes are, because, as the Standard says, their "directionality and pairing status is less predictable than paired brackets." But we're not particularly interested in those semantics, most string delimiters will be selected only for their visual appearance. Because they aren't in the bidi algorithm, there is no property that maps one member of a pair to its mate. However, Two characters whose names pair only by LEFT vs RIGHT are almost certainly a mirrored pair. This doesn't catch all possibilities; future commits will expand the ones caught. The commit refactors things so as to make future commits easier which look at even more delimiter possibilities.
* Allow reversal of some paired delimiters; deprecationsKarl Williamson2022-03-191-18/+18
| | | | | | | | | | | | | | | Unicode says certain opening punctuation characters may be used as closing ones in some languages; and their mirror is instead the opening one. This commit changes to allow either one of each such set to be the opening one. It also deprecates the use of any of the new mirrored delimiters to be used outside the feature as an unmirrored delimiter, and the normal closing delimiter from being used as an unpaired opening one while in the feature. This gives us the freedom to make some or all of the new paired delimiters be reversible.
* unicode_constants.pl: Generate paired string delimitersKarl Williamson2022-03-191-0/+42
| | | | | | | | | This commit causes several C strings to be generated containing bytes that match paired string delimiters beyond the four that have traditionally been used in Perl. This will allow a future commit to accept more matching delimiters around strings than those four. The code explains how the added delimiters are chosen.
* regen/unicode_constants.pl: White space onlyKarl Williamson2022-03-191-9/+9
| | | | Align the output of this bit vertically with surrounding output.
* Support Unicode 14.0Unicode Consortium2021-09-151-2/+2
|
* Restrict scope/Shorten some very long macro namesKarl Williamson2020-11-221-5/+15
| | | | | | The names were intended to force people to not use them outside their intended scopes. But by restricting those scopes in the first place, we don't need such unwieldy names
* autodoc.pl: Enhance apidoc_section featureKarl Williamson2020-11-061-1/+1
| | | | | | | | | | | This feature allows documentation destined for perlapi or perlintern to be split into sections of related functions, no matter where the documentation source is. Prior to this commit the line had to contain the exact text of the title of the section. Now it can be a $variable name that autodoc.pl expands to the title. It still has to be an exact match for the variable in autodoc, but now, the expanded text can be changed in autodoc alone, without other files needing to be updated at the same time.
* regen/unicode_constants.pl: Add a couple constantsKarl Williamson2020-10-161-0/+6
| | | | which will be needed in a future commit
* Change some =head1 to apidoc_section linesKarl Williamson2020-09-041-1/+1
| | | | | apidoc_section is slightly favored over head1, as it is known only to autodoc, and can't be confused with real pod.
* Use Unicode 13.0 (beta)Unicode Consortium2020-01-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has changed its yearly release cycle so that the final version is not available until early March of the year. This year it is March 10, 2020. However, all changes planned were finalized in early January, and the actual computer files have been updated to their presumably final substantive versions. The release has been authorized without further review needed. The release is awaiting final documentation additions, and soak time of the beta to verify there are no glitches. This commit causes Perl to participate in that soak. I don't anticipate any problems, and likely the only substantive change upon the official release will be to update perldelta. Comments in the files supplied by Unicode will likely also change to indicate these are no longer beta. There were very few changes affecting existing characters; most of the changes involved adding new characters, including emoji. The break characteristics of some existing characters were changed (GCB, LB, WB, and SB properties). The only perl code I really had to change to cope with the new release was about rules in the Line Break property, dealing around ellipses (...) and certain East Asian characters next to opening parentheses. If there are problems, we can revert this at any time, and ship with 12.0.
* Fix apidoc macro entriesKarl Williamson2019-06-251-2/+2
| | | | | | | | | | This makes various fixes to the text that is used to generate the documentation. The dominant change is to add the 'n' flag to indicate that the macro takes no arguments. A couple should have been marked with a D (for deprecated) flag, and a couple were missing parameters, and a couple were missing return values. These were spotted by using Devel::PPPort on them.
* Preliminary Unicode 12.1Unicode Consortium2019-04-081-2/+2
|
* Check for \n in EBCDIC code pagesKarl Williamson2019-03-061-2/+2
| | | | | | | IBM says that there are 13 characters whose code point varies depending on the EBCDIC code page. They fail to mention that the \n character may also vary. This commit adds checks for \n, in addition to the checks for the 13 graphic variant ones.
* Use Unicode 12.0Unicode Consortium2019-03-041-2/+2
| | | | Unicode 12.0 is finalized. Change to use it.
* regen/unicode_constants.pl: generate UTF-8 for U+307Karl Williamson2019-02-041-3/+3
| | | | This will be needed in a future commit
* pp.c: Don't use function call for easy copyKarl Williamson2019-02-041-3/+0
| | | | | | | | | | Like the previous commit, this code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special.
* pp.c: Don't use function call for easy copyKarl Williamson2019-02-041-3/+0
| | | | | | | | | This code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special.
* Use Unicode 11.0Unicode Consortium2018-07-201-2/+2
| | | | This completes the process of upgrading to Unicode 11.0.
* regen/unicode_constants.pl: Add U+10FFFF entryKarl Williamson2017-11-181-0/+6
| | | | | We need the length of the UTF-8 for this code point elsewhere, and it is different between ASCII and EBCDIC.
* Use Unicode 10.0Karl Williamson2017-06-201-2/+2
| | | | | | The new file from Unicode "extracted/DerivedName.txt" is not delivered here, as Perl doesn't need it, as it duplicates information in other files.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-3/+3
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* Add C macros for UTF-8 for BOM and REPLACEMENT CHARACTERKarl Williamson2016-08-311-0/+36
| | | | | This makes it easy for module authors to write XS code that can use these characters, and be automatically portable to EBCDIC systems.
* Use Unicode 9.0Unicode Consortium2016-06-211-3/+3
| | | | | This includes regenerating the files that depend on the Unicode 9 data files
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-39/+0
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Skip casing for high code pointsKarl Williamson2015-12-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in the previous commit, most code points in Unicode don't change if upper-, or lower-cased, etc. In fact as of Unicode v8.0, 93% of the available code points are above the highest one that does change. This commit skips trying to case these 93%. A regen/ script keeps track of the max changing one in the current Unicode release, and skips casing for the higher ones. Thus currently, casing emoji will be skipped. Together with the previous commits that dealt with casing, the potential for huge memory requirements for the swash hashes for casing are severely limited. If the following command is run on a perl compiled with -O2 and no DEBUGGING: blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after and the file 'plane1_case_perf' contains [ 'string::casing::emoji' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK code => 'uc($a)' }, ]; the following results are obtained: The numbers represent raw counts per loop iteration. string::casing::emoji yes swash vs no swash before_this_commit after ------------------ -------- Ir 981.0 306.0 Dr 228.0 94.0 Dw 100.0 45.0 COND 137.0 49.0 IND 7.0 4.0 COND_m 5.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 -0.1 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0
* Remove no longer used #defineKarl Williamson2015-09-041-4/+0
| | | | The previous commit removed all uses of this non-public #define.
* regen/unicode_constants.pl: Add U+130, +131Karl Williamson2015-07-281-0/+8
| | | | These will be used in the next commit
* regen/unicode_constants.pl: Generate #defines giving which Unicode versionKarl Williamson2015-07-281-4/+9
| | | | | Future commits will want to take different actions depending on which Unicode version is being used.