summaryrefslogtreecommitdiff
path: root/regen/unicode_constants.pl
Commit message (Collapse)AuthorAgeFilesLines
* Remove duplicate "the" in commentsElvin Aslanov2023-05-031-1/+1
| | | | Fix spelling on various files pertaining to core Perl.
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-2/+2
| | | | | | | | | | | | | | | | | This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.
* Add arrows to paired string delimitersKarl Williamson2022-03-191-0/+9
| | | | | | | | | | | Unicode has lots of arrows of various shapes, sizes, and directions. None of them were of consequence to the Bidirectional algorithm, so none were specified as being mirrored pairs. This commit uses the generalizations already in place from previous commits to examine arrow symbols and choose which are mirrored pairs. As previously, it rejects arrows with contrary directionality, and ones without horizontal directionality.
* Add SPEAKERs paired delimitersKarl Williamson2022-03-191-0/+2
| | | | The characters with this name look good as mirrored delimiters.
* Add TELEPHONE RECEIVER paired delimitersKarl Williamson2022-03-191-0/+2
| | | | The characters with this name look good as mirrored delimiters.
* Add ERASE paired delimitersKarl Williamson2022-03-191-0/+1
| | | | The characters with this name look good as mirrored delimiters.
* Add DOUBLE TRIANGLEs paired delimitersKarl Williamson2022-03-191-0/+2
| | | | The characters with this name look good as mirrored delimiters.
* Add THREE RAYS paired delimitersKarl Williamson2022-03-191-0/+1
| | | | The characters with this name look good as mirrored delimiters.
* Add musical score paired delimitersKarl Williamson2022-03-191-0/+9
| | | | | The characters that signify the beginning and ending of Western music scores serve as good delimiters
* Add INDEX paired delimitersKarl Williamson2022-03-191-0/+3
| | | | | | The bidi-aware characters containing this word are visually suitable for being mirrored delimiters. The 'index' refers to the index finger in a hand pointing at the delimited string
* Add TURNSTILE paired delimitersKarl Williamson2022-03-191-0/+1
| | | | | The bidi-aware characters containing this word are visually suitable for being mirrored delimiters.
* Add TACK paired delimitersKarl Williamson2022-03-191-1/+1
| | | | | The bidi-aware characters containing this word are visually suitable for being mirrored delimiters.
* Directionality pres/abs-ence can mean paired delimitersKarl Williamson2022-03-191-0/+11
| | | | | | | | | | | | | | | | Another way Unicode indicates that a character has horizontal directionality is by adding LEFT or RIGHT to the name of a base character. Hence we get RIGHT SPEAKER vs just plain SPEAKER. Presumably this comes about when they didn't consider directionality at first, and then realized later it was needed. This commit makes the script look for these kinds of character pairs. Because the current Unicode version only has this characteristic for Symbols, and symbols must be included explicitly, no changes in what gets paired ensues. But if you turn on the outputting of characters not chosen, that list will now include things meeting this new criteria. Less than a handful actually are like this.
* unicode_constants.pl: Prepare for examining SymbolsKarl Williamson2022-03-191-20/+206
| | | | | | | | | | | | | | | | | | | | | | Heretofore, the code looking for paired string delimiters has looked at punctuation, and a few symbols that Unicode gives a mirror for. But there are many more suitable-for-pairing characters in Unicode. This commit generalizes things so as to handle the extra complexities of the way symbols are named beyond the punctuation names. For example, RIGHTWARDS is sometimes used; it turns out that it also is used in one punctuation character, which was previously overlooked by this script. The generalization introduced by this commit handles almost all current Unicode symbols properly. But some symbols are barely distinguishable from their mirrors, such as a tilde and a reversed tilde. The scheme adopted here, then, makes the default for a symbol pair to not be marked as paired delimiters. The code explicitly has to specify that a given pair is to be included. The next few commits are mostly for adding ones that I thought were good.
* Add 'ELEMENT OF'/CONTAINS to paired string delimitersKarl Williamson2022-03-191-0/+2
| | | | | This commit adds 8 pairs of symbols that are variants on ELEMENT OF These make nice paired delimiters in the vein of < >
* Add SUBSET/SUPERSET to paired string delimitersKarl Williamson2022-03-191-0/+2
| | | | | This commit adds 20 pairs of symbols that are variants on SUBSET These make nice paired delimiters in the vein of < >
* Add PRECEDES/SUCCEEDS to paired string delimitersKarl Williamson2022-03-191-0/+4
| | | | | This commit adds 15 pairs of symbols that are variants on PRECEDES. These look a lot like <>, so makes sense to make them paired delimiters.
* Add SMALLER THAN to paired string delimitersKarl Williamson2022-03-191-0/+7
| | | | | This commit adds 2 pairs of symbols that are variants on SMALLER THAN. These look a lot like <>, so makes sense to make them paired delimiters.
* unicode_constants.pl: Consider all \pP for delimsKarl Williamson2022-03-191-1/+27
| | | | | | | | | | | | | Previously, only the punctuation characters that Unicode had classed as being opening/closing were considered in looking for suitable paired delimiters. This commit looks at all punctuation characters. There are actually only 7 new pairs found. This gives us ꧁ ꧂ as string delimiterss, if your font allows, which are Javanese and used to surround an honorific title, according to Wikipedia.
* Add < > variants to paired delimitersKarl Williamson2022-03-191-5/+74
| | | | | | | Perl considers '< >' to be delimiters for strings; this commit adds most of the Unicode variants of these to also be string delimiters. The ones that are combinations of both < and >, aren't included, as that would be visually confusing.
* unicode_constants.pl: Add REVERSED punctuationKarl Williamson2022-03-191-0/+2
| | | | | | | | Besides LEFT/RIGHT, horizontal directionality can be specified by Unicode in names by the presence or absence of REVERSED. Enhancing the algorithm to take this into account adds 2 pairs or mirrored delimiters that were previously overlooked.
* unicode_constants.pl: Output why chars not chosenKarl Williamson2022-03-191-2/+64
| | | | | | | | | | | | This script now examines all punctuation characters to see if there is a mirrored character for it, suitable for use as a Perl string delimiter. Some don't qualify, and some do qualify but the script doesn't catch them. This commit adds the ability to output which characters it doesn't think qualify, and why. This enables a maintainer to easily check and know what its deficiencies are, or that there is a good reason that a particular character gets rejected.
* unicode_constants.pl: Refactor to catch more paired delimsKarl Williamson2022-03-191-23/+124
| | | | | | | | | | | | | | | | | | | | | | | Previously, only characters that Unicode included in its bidirectional algorithm have been eligible to be found by this program to be mirrored string delimiters. This commit adds 5 quotation marker character pairs that are omitted from the bidirectional algorithm, as most quotes are, because, as the Standard says, their "directionality and pairing status is less predictable than paired brackets." But we're not particularly interested in those semantics, most string delimiters will be selected only for their visual appearance. Because they aren't in the bidi algorithm, there is no property that maps one member of a pair to its mate. However, Two characters whose names pair only by LEFT vs RIGHT are almost certainly a mirrored pair. This doesn't catch all possibilities; future commits will expand the ones caught. The commit refactors things so as to make future commits easier which look at even more delimiter possibilities.
* Allow reversal of some paired delimiters; deprecationsKarl Williamson2022-03-191-1/+12
| | | | | | | | | | | | | | | Unicode says certain opening punctuation characters may be used as closing ones in some languages; and their mirror is instead the opening one. This commit changes to allow either one of each such set to be the opening one. It also deprecates the use of any of the new mirrored delimiters to be used outside the feature as an unmirrored delimiter, and the normal closing delimiter from being used as an unpaired opening one while in the feature. This gives us the freedom to make some or all of the new paired delimiters be reversible.
* regen/unicode_constants.pl: List paired delimitersKarl Williamson2022-03-191-0/+101
| | | | | | This adds the capability to temporarily change a scalar to true to cause this to print on stderr a list of the paired string delimiters, suitable for pasting into a pod.
* unicode_constants.pl: Generate paired string delimitersKarl Williamson2022-03-191-4/+113
| | | | | | | | | This commit causes several C strings to be generated containing bytes that match paired string delimiters beyond the four that have traditionally been used in Perl. This will allow a future commit to accept more matching delimiters around strings than those four. The code explains how the added delimiters are chosen.
* regen/unicode_constants.pl: Extract code into a fcnKarl Williamson2022-03-191-1/+21
| | | | | This is in preparation for it to be used in multiple places in a future commit.
* regen/unicode_constants.pl: White space onlyKarl Williamson2022-03-191-3/+3
| | | | Align the output of this bit vertically with surrounding output.
* Restrict scope/Shorten some very long macro namesKarl Williamson2020-11-221-8/+24
| | | | | | The names were intended to force people to not use them outside their intended scopes. But by restricting those scopes in the first place, we don't need such unwieldy names
* autodoc.pl: Enhance apidoc_section featureKarl Williamson2020-11-061-1/+1
| | | | | | | | | | | This feature allows documentation destined for perlapi or perlintern to be split into sections of related functions, no matter where the documentation source is. Prior to this commit the line had to contain the exact text of the title of the section. Now it can be a $variable name that autodoc.pl expands to the title. It still has to be an exact match for the variable in autodoc, but now, the expanded text can be changed in autodoc alone, without other files needing to be updated at the same time.
* regen/unicode_constants.pl: Add a couple constantsKarl Williamson2020-10-161-0/+2
| | | | which will be needed in a future commit
* Change some =head1 to apidoc_section linesKarl Williamson2020-09-041-1/+1
| | | | | apidoc_section is slightly favored over head1, as it is known only to autodoc, and can't be confused with real pod.
* Fix apidoc macro entriesKarl Williamson2019-06-251-2/+2
| | | | | | | | | | This makes various fixes to the text that is used to generate the documentation. The dominant change is to add the 'n' flag to indicate that the macro takes no arguments. A couple should have been marked with a D (for deprecated) flag, and a couple were missing parameters, and a couple were missing return values. These were spotted by using Devel::PPPort on them.
* regen/unicode_constants.pl: generate UTF-8 for U+307Karl Williamson2019-02-041-1/+1
| | | | This will be needed in a future commit
* pp.c: Don't use function call for easy copyKarl Williamson2019-02-041-1/+0
| | | | | | | | | | Like the previous commit, this code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special.
* pp.c: Don't use function call for easy copyKarl Williamson2019-02-041-1/+0
| | | | | | | | | This code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special.
* regen/unicode_constants.pl: Add U+10FFFF entryKarl Williamson2017-11-181-0/+2
| | | | | We need the length of the UTF-8 for this code point elsewhere, and it is different between ASCII and EBCDIC.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-3/+3
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* Patch unit tests to explicitly insert "." into @INC when needed.H.Merijn Brand2016-11-111-2/+2
| | | | | require calls now require ./ to be prepended to the file since . is no longer guaranteed to be in @INC.
* Add C macros for UTF-8 for BOM and REPLACEMENT CHARACTERKarl Williamson2016-08-311-0/+31
| | | | | This makes it easy for module authors to write XS code that can use these characters, and be automatically portable to EBCDIC systems.
* Skip casing for high code pointsKarl Williamson2015-12-091-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in the previous commit, most code points in Unicode don't change if upper-, or lower-cased, etc. In fact as of Unicode v8.0, 93% of the available code points are above the highest one that does change. This commit skips trying to case these 93%. A regen/ script keeps track of the max changing one in the current Unicode release, and skips casing for the higher ones. Thus currently, casing emoji will be skipped. Together with the previous commits that dealt with casing, the potential for huge memory requirements for the swash hashes for casing are severely limited. If the following command is run on a perl compiled with -O2 and no DEBUGGING: blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after and the file 'plane1_case_perf' contains [ 'string::casing::emoji' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK code => 'uc($a)' }, ]; the following results are obtained: The numbers represent raw counts per loop iteration. string::casing::emoji yes swash vs no swash before_this_commit after ------------------ -------- Ir 981.0 306.0 Dr 228.0 94.0 Dw 100.0 45.0 COND 137.0 49.0 IND 7.0 4.0 COND_m 5.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 -0.1 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0
* Remove no longer used #defineKarl Williamson2015-09-041-1/+0
| | | | The previous commit removed all uses of this non-public #define.
* regen/unicode_constants.pl: Add U+130, +131Karl Williamson2015-07-281-0/+2
| | | | These will be used in the next commit
* regen/unicode_constants.pl: Generate #defines giving which Unicode versionKarl Williamson2015-07-281-4/+17
| | | | | Future commits will want to take different actions depending on which Unicode version is being used.
* regen/unicode_constants.pl: Skip U+1E9E if not in Unicode versionKarl Williamson2015-07-281-1/+1
| | | | | LATIN CAPITAL LETTER SHARP S is not available in all Unicode releases; simply skip generating things when it isn't there.
* regen/unicode_constants.pl: Fix to work under skip_if_undefKarl Williamson2015-07-281-2/+2
| | | | This input flag was not being properly handled.
* regen/unicode_constants.pl: Move comments aroundKarl Williamson2015-07-281-28/+32
| | | | | This moves the description of what <DATA> lines should look like to the __DATA__ line.
* Tighten uses of regex synthetic start classKarl Williamson2014-09-291-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A synthetic start class (SSC) is generated by the regular expression pattern compiler to give a consolidation of all the possible things that can match at the beginning of where a pattern can possibly match. For example qr/a?bfoo/; requires the match to begin with either an 'a' or a 'b'. There are no other possibilities. We can set things up to quickly scan for either of these in the target string, and only when one of these is found do we need to look for 'foo'. There is an overhead associated with using SSCs. If the number of possibilities that the SSC excludes is relatively small, it can be counter-productive to use them. This patch creates a crude sieve to decide whether to use an SSC or not. If the SSC doesn't exclude at least half the "likely" possiblities, it is discarded. This patch is a starting point, and can be refined if necessary as we gain experience. See thread beginning with http://nntp.perl.org/group/perl.perl5.porters/212644 In many patterns, no SSC is generated; and with the advent of tries, SSC's have become less important, so whatever we do is not terribly critical.
* regen/unicode_constants.pl: Find max ascii print cpKarl Williamson2014-08-251-0/+7
| | | | | | | This creates a #define that gives the highest code point that is an ASCII printable. On ASCII-ish platforms, this is 0x7E, but on EBCDIC platforms it varies, and can be as high as 0xFF. This is in preparation for needing this value in a future commit in regcomp.c
* unicode_constants.h: Add definitions for ESC and VTKarl Williamson2014-08-211-0/+2
| | | | These will be used in future commits