summaryrefslogtreecommitdiff
path: root/unicode_constants.h
Commit message (Collapse)AuthorAgeFilesLines
* Use Unicode 13.0 (beta)Unicode Consortium2020-01-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has changed its yearly release cycle so that the final version is not available until early March of the year. This year it is March 10, 2020. However, all changes planned were finalized in early January, and the actual computer files have been updated to their presumably final substantive versions. The release has been authorized without further review needed. The release is awaiting final documentation additions, and soak time of the beta to verify there are no glitches. This commit causes Perl to participate in that soak. I don't anticipate any problems, and likely the only substantive change upon the official release will be to update perldelta. Comments in the files supplied by Unicode will likely also change to indicate these are no longer beta. There were very few changes affecting existing characters; most of the changes involved adding new characters, including emoji. The break characteristics of some existing characters were changed (GCB, LB, WB, and SB properties). The only perl code I really had to change to cope with the new release was about rules in the Line Break property, dealing around ellipses (...) and certain East Asian characters next to opening parentheses. If there are problems, we can revert this at any time, and ship with 12.0.
* Fix apidoc macro entriesKarl Williamson2019-06-251-2/+2
| | | | | | | | | | This makes various fixes to the text that is used to generate the documentation. The dominant change is to add the 'n' flag to indicate that the macro takes no arguments. A couple should have been marked with a D (for deprecated) flag, and a couple were missing parameters, and a couple were missing return values. These were spotted by using Devel::PPPort on them.
* Preliminary Unicode 12.1Unicode Consortium2019-04-081-2/+2
|
* Check for \n in EBCDIC code pagesKarl Williamson2019-03-061-2/+2
| | | | | | | IBM says that there are 13 characters whose code point varies depending on the EBCDIC code page. They fail to mention that the \n character may also vary. This commit adds checks for \n, in addition to the checks for the 13 graphic variant ones.
* Use Unicode 12.0Unicode Consortium2019-03-041-2/+2
| | | | Unicode 12.0 is finalized. Change to use it.
* regen/unicode_constants.pl: generate UTF-8 for U+307Karl Williamson2019-02-041-3/+3
| | | | This will be needed in a future commit
* pp.c: Don't use function call for easy copyKarl Williamson2019-02-041-3/+0
| | | | | | | | | | Like the previous commit, this code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special.
* pp.c: Don't use function call for easy copyKarl Williamson2019-02-041-3/+0
| | | | | | | | | This code is adding the UTF-8 for a Greek character to a string. It previously used Copy, but this character is representable as two bytes in both ASCII and EBCDIC UTF-8, the only character sets that Perl will ever supports, so we can use the specialized code that is used most everywhere else for two byte UTF-8 characters, avoiding the function overhead, and having to treat this character as particularly special.
* Use Unicode 11.0Unicode Consortium2018-07-201-2/+2
| | | | This completes the process of upgrading to Unicode 11.0.
* regen/unicode_constants.pl: Add U+10FFFF entryKarl Williamson2017-11-181-0/+6
| | | | | We need the length of the UTF-8 for this code point elsewhere, and it is different between ASCII and EBCDIC.
* Use Unicode 10.0Karl Williamson2017-06-201-2/+2
| | | | | | The new file from Unicode "extracted/DerivedName.txt" is not delivered here, as Perl doesn't need it, as it duplicates information in other files.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-3/+3
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* Add C macros for UTF-8 for BOM and REPLACEMENT CHARACTERKarl Williamson2016-08-311-0/+36
| | | | | This makes it easy for module authors to write XS code that can use these characters, and be automatically portable to EBCDIC systems.
* Use Unicode 9.0Unicode Consortium2016-06-211-3/+3
| | | | | This includes regenerating the files that depend on the Unicode 9 data files
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-39/+0
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Skip casing for high code pointsKarl Williamson2015-12-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in the previous commit, most code points in Unicode don't change if upper-, or lower-cased, etc. In fact as of Unicode v8.0, 93% of the available code points are above the highest one that does change. This commit skips trying to case these 93%. A regen/ script keeps track of the max changing one in the current Unicode release, and skips casing for the higher ones. Thus currently, casing emoji will be skipped. Together with the previous commits that dealt with casing, the potential for huge memory requirements for the swash hashes for casing are severely limited. If the following command is run on a perl compiled with -O2 and no DEBUGGING: blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after and the file 'plane1_case_perf' contains [ 'string::casing::emoji' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK code => 'uc($a)' }, ]; the following results are obtained: The numbers represent raw counts per loop iteration. string::casing::emoji yes swash vs no swash before_this_commit after ------------------ -------- Ir 981.0 306.0 Dr 228.0 94.0 Dw 100.0 45.0 COND 137.0 49.0 IND 7.0 4.0 COND_m 5.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 -0.1 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0
* Remove no longer used #defineKarl Williamson2015-09-041-4/+0
| | | | The previous commit removed all uses of this non-public #define.
* regen/unicode_constants.pl: Add U+130, +131Karl Williamson2015-07-281-0/+8
| | | | These will be used in the next commit
* regen/unicode_constants.pl: Generate #defines giving which Unicode versionKarl Williamson2015-07-281-4/+9
| | | | | Future commits will want to take different actions depending on which Unicode version is being used.
* Switch to Unicode Version 8The Unicode Consortium2015-06-171-1/+1
|
* Tighten uses of regex synthetic start classKarl Williamson2014-09-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A synthetic start class (SSC) is generated by the regular expression pattern compiler to give a consolidation of all the possible things that can match at the beginning of where a pattern can possibly match. For example qr/a?bfoo/; requires the match to begin with either an 'a' or a 'b'. There are no other possibilities. We can set things up to quickly scan for either of these in the target string, and only when one of these is found do we need to look for 'foo'. There is an overhead associated with using SSCs. If the number of possibilities that the SSC excludes is relatively small, it can be counter-productive to use them. This patch creates a crude sieve to decide whether to use an SSC or not. If the SSC doesn't exclude at least half the "likely" possiblities, it is discarded. This patch is a starting point, and can be refined if necessary as we gain experience. See thread beginning with http://nntp.perl.org/group/perl.perl5.porters/212644 In many patterns, no SSC is generated; and with the advent of tries, SSC's have become less important, so whatever we do is not terribly critical.
* regen/unicode_constants.pl: Find max ascii print cpKarl Williamson2014-08-251-0/+4
| | | | | | | This creates a #define that gives the highest code point that is an ASCII printable. On ASCII-ish platforms, this is 0x7E, but on EBCDIC platforms it varies, and can be as high as 0xFF. This is in preparation for needing this value in a future commit in regcomp.c
* unicode_constants.h: Add definitions for ESC and VTKarl Williamson2014-08-211-0/+8
| | | | These will be used in future commits
* regen/unicode_constants.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-21/+130
| | | | | This causes the generated unicode_constants.h to be valid on all supported platforms
* Deprecate NBSP in \N{...} namesKarl Williamson2014-05-301-0/+3
| | | | | | | This is currently allowed, but is non-graphic, and is indistinguishable from a regular space. I was the one who initially allowed it, and did so out of ignorance of the negative consequences of doing so. There is no other precedent for including it.
* Remove no longer necessary constantsKarl Williamson2013-08-291-6/+0
| | | | | | These character constants were used only for a special edge case in trie construction that has been removed -- except for one instance in regexec.c which could just as well be some other character.
* utf8.h, unicode_constants.h: Add some #defines.Karl Williamson2013-08-291-0/+3
| | | | These will be used in a future commit
* unicode_constants.h: Add #defines for CR, LFKarl Williamson2013-08-291-0/+2
|
* unicode_constants.h: Add #defines for Byte Order MarkKarl Williamson2013-08-291-0/+2
| | | | These will be used in future commits
* unicode_constants.h: Add some #definesKarl Williamson2013-05-201-0/+3
| | | | These will be used in future commits
* pp.c: Eliminate custom macro and use Copy() insteadKarl Williamson2013-05-201-0/+3
| | | | | | I think it's clearer to use Copy. When I wrote this custom macro, we didn't have the infrastructure to generate a UTF-8 encoded string at compile time.
* regen/unicode_constants.pl: Change #define nameKarl Williamson2013-03-081-1/+1
| | | | | This was added in the 5.17 series so there's no code relying on its current name. I think that the abbreviation is clearer.
* regen/unicode_constants.pl: Make portable to non-ASCIIKarl Williamson2013-03-081-13/+14
| | | | | | This now uses the U+ notation to indicate code points, which is unambiguous not matter what the platform's character set is. (charnames accepts the U+ notation)
* regen/unicode_constants.pl: Remove unused constantKarl Williamson2013-03-081-1/+0
| | | | | This was added in the 5.17 series, so can't be yet in the field; and isn't needed.
* regcomp.c: Refactor join_exact() to handle all multi-char foldsKarl Williamson2012-10-091-4/+2
| | | | | | | | | | join_exact() prior to this commit returned a delta for 3 problematic sequences showing that the minimum length they match is less than their nominal length. It turns out that this is needed for all multi-character fold sequences; our test suite just did not have the tests in it to show that. Tests that do show this will be added in a future commit, but code elsewhere must be fixed before they pass. regcomp.c
* regen/unicode_constants.pl: Add name parameterKarl Williamson2012-09-131-0/+1
| | | | | | | A future commit will want to use the first surrogate code point's UTF-8 value. Add this to the generated macros, and give it a name, since there is no official one. The program has to be modified to cope with this.
* regexec.c: Use new macros instead of swashesKarl Williamson2012-09-131-3/+0
| | | | | | | | | | A previous commit has caused macros to be generated that will match Unicode code points of interest to the \X algorithm. This patch uses them. This speeds up modern Korean processing by 15%. Together with recent previous commits, the throughput of modern Korean under \X has more than doubled, and is now comparable to other languages (which have increased themselved by 35%)
* Rename regen'd hdr to reflect expanded capabilitiesKarl Williamson2012-09-131-0/+48
The recently added utf8_strings.h has been expanded to include more than just strings. I'm renaming it to avoid confusion.