| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
The names were intended to force people to not use them outside their
intended scopes. But by restricting those scopes in the first place, we
don't need such unwieldy names
|
|
|
|
|
|
|
|
|
|
|
| |
This feature allows documentation destined for perlapi or perlintern to
be split into sections of related functions, no matter where the
documentation source is. Prior to this commit the line had to contain
the exact text of the title of the section. Now it can be a $variable
name that autodoc.pl expands to the title. It still has to be an exact
match for the variable in autodoc, but now, the expanded text can be
changed in autodoc alone, without other files needing to be updated at
the same time.
|
|
|
|
| |
which will be needed in a future commit
|
|
|
|
|
| |
apidoc_section is slightly favored over head1, as it is known only to
autodoc, and can't be confused with real pod.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has changed its yearly release cycle so that the final version
is not available until early March of the year. This year it is March
10, 2020.
However, all changes planned were finalized in early January, and the
actual computer files have been updated to their presumably final
substantive versions. The release has been authorized without further
review needed.
The release is awaiting final documentation additions, and soak time of
the beta to verify there are no glitches. This commit causes Perl to
participate in that soak.
I don't anticipate any problems, and likely the only substantive change
upon the official release will be to update perldelta. Comments in the
files supplied by Unicode will likely also change to indicate these are
no longer beta.
There were very few changes affecting existing characters; most of the
changes involved adding new characters, including emoji. The break
characteristics of some existing characters were changed (GCB, LB, WB,
and SB properties). The only perl code I really had to change to cope
with the new release was about rules in the Line Break property, dealing
around ellipses (...) and certain East Asian characters next to opening
parentheses.
If there are problems, we can revert this at any time, and ship with
12.0.
|
|
|
|
|
|
|
|
|
|
| |
This makes various fixes to the text that is used to generate the
documentation. The dominant change is to add the 'n' flag to indicate
that the macro takes no arguments. A couple should have been marked
with a D (for deprecated) flag, and a couple were missing parameters,
and a couple were missing return values.
These were spotted by using Devel::PPPort on them.
|
| |
|
|
|
|
|
|
|
| |
IBM says that there are 13 characters whose code point varies depending
on the EBCDIC code page. They fail to mention that the \n character may
also vary. This commit adds checks for \n, in addition to the checks
for the 13 graphic variant ones.
|
|
|
|
| |
Unicode 12.0 is finalized. Change to use it.
|
|
|
|
| |
This will be needed in a future commit
|
|
|
|
|
|
|
|
|
|
| |
Like the previous commit, this code is adding the UTF-8 for a Greek
character to a string. It previously used Copy, but this character is
representable as two bytes in both ASCII and EBCDIC UTF-8, the only
character sets that Perl will ever supports, so we can use the
specialized code that is used most everywhere else for two byte UTF-8
characters, avoiding the function overhead, and having to treat this
character as particularly special.
|
|
|
|
|
|
|
|
|
| |
This code is adding the UTF-8 for a Greek character to a string. It
previously used Copy, but this character is representable as two bytes
in both ASCII and EBCDIC UTF-8, the only character sets that Perl will
ever supports, so we can use the specialized code that is used most
everywhere else for two byte UTF-8 characters, avoiding the function
overhead, and having to treat this character as particularly special.
|
|
|
|
| |
This completes the process of upgrading to Unicode 11.0.
|
|
|
|
|
| |
We need the length of the UTF-8 for this code point elsewhere, and it
is different between ASCII and EBCDIC.
|
|
|
|
|
|
| |
The new file from Unicode "extracted/DerivedName.txt" is not delivered
here, as Perl doesn't need it, as it duplicates information in other
files.
|
|
|
|
|
|
|
|
|
|
| |
We changed to use symbols not likely to be used by non-Perl code that
could conflict, and which have trailing underbars, so they don't look
like a regular Perl #define.
See https://rt.perl.org/Ticket/Display.html?id=131110
There are many more header files which are not guarded.
|
|
|
|
|
| |
This makes it easy for module authors to write XS code that can use
these characters, and be automatically portable to EBCDIC systems.
|
|
|
|
|
| |
This includes regenerating the files that depend on the Unicode 9 data
files
|
|
|
|
|
|
|
|
|
|
|
| |
This commit comments out the code that generates these tables. This is
trivially reversible. We don't believe anyone is using Perl and
POSIX-BC at this time, and this saves time during development when
having to regenerate these tables, and makes the resulting tar ball
smaller.
See thread beginning at
http://nntp.perl.org/group/perl.perl5.porters/233663
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed in the previous commit, most code points in Unicode
don't change if upper-, or lower-cased, etc. In fact as of Unicode
v8.0, 93% of the available code points are above the highest one that
does change.
This commit skips trying to case these 93%. A regen/ script keeps track
of the max changing one in the current Unicode release, and skips casing
for the higher ones. Thus currently, casing emoji will be skipped.
Together with the previous commits that dealt with casing, the potential
for huge memory requirements for the swash hashes for casing are
severely limited.
If the following command is run on a perl compiled with -O2 and no
DEBUGGING:
blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after
and the file 'plane1_case_perf' contains
[
'string::casing::emoji' => {
desc => 'yes swash vs no swash',
setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK
code => 'uc($a)'
},
];
the following results are obtained:
The numbers represent raw counts per loop iteration.
string::casing::emoji
yes swash vs no swash
before_this_commit after
------------------ --------
Ir 981.0 306.0
Dr 228.0 94.0
Dw 100.0 45.0
COND 137.0 49.0
IND 7.0 4.0
COND_m 5.5 0.0
IND_m 4.0 2.0
Ir_m1 0.1 -0.1
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
|
|
|
|
| |
The previous commit removed all uses of this non-public #define.
|
|
|
|
| |
These will be used in the next commit
|
|
|
|
|
| |
Future commits will want to take different actions depending on which
Unicode version is being used.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A synthetic start class (SSC) is generated by the regular expression
pattern compiler to give a consolidation of all the possible things that
can match at the beginning of where a pattern can possibly match.
For example
qr/a?bfoo/;
requires the match to begin with either an 'a' or a 'b'. There are no
other possibilities. We can set things up to quickly scan for either of
these in the target string, and only when one of these is found do we
need to look for 'foo'.
There is an overhead associated with using SSCs. If the number of
possibilities that the SSC excludes is relatively small, it can be
counter-productive to use them.
This patch creates a crude sieve to decide whether to use an SSC or not.
If the SSC doesn't exclude at least half the "likely" possiblities, it
is discarded. This patch is a starting point, and can be refined if
necessary as we gain experience.
See thread beginning with
http://nntp.perl.org/group/perl.perl5.porters/212644
In many patterns, no SSC is generated; and with the advent of tries,
SSC's have become less important, so whatever we do is not terribly
critical.
|
|
|
|
|
|
|
| |
This creates a #define that gives the highest code point that is an
ASCII printable. On ASCII-ish platforms, this is 0x7E, but on EBCDIC
platforms it varies, and can be as high as 0xFF. This is in preparation
for needing this value in a future commit in regcomp.c
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
| |
This causes the generated unicode_constants.h to be valid on all
supported platforms
|
|
|
|
|
|
|
| |
This is currently allowed, but is non-graphic, and is indistinguishable
from a regular space. I was the one who initially allowed it, and did
so out of ignorance of the negative consequences of doing so. There is
no other precedent for including it.
|
|
|
|
|
|
| |
These character constants were used only for a special edge case in trie
construction that has been removed -- except for one instance in
regexec.c which could just as well be some other character.
|
|
|
|
| |
These will be used in a future commit
|
| |
|
|
|
|
| |
These will be used in future commits
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
|
| |
I think it's clearer to use Copy. When I wrote this custom macro, we
didn't have the infrastructure to generate a UTF-8 encoded string at
compile time.
|
|
|
|
|
| |
This was added in the 5.17 series so there's no code relying on its
current name. I think that the abbreviation is clearer.
|
|
|
|
|
|
| |
This now uses the U+ notation to indicate code points, which is
unambiguous not matter what the platform's character set is. (charnames
accepts the U+ notation)
|
|
|
|
|
| |
This was added in the 5.17 series, so can't be yet in the field; and
isn't needed.
|
|
|
|
|
|
|
|
|
|
| |
join_exact() prior to this commit returned a delta for 3 problematic
sequences showing that the minimum length they match is less than their
nominal length. It turns out that this is needed for all
multi-character fold sequences; our test suite just did not have the
tests in it to show that. Tests that do show this will be added in a
future commit, but code elsewhere must be fixed before they pass.
regcomp.c
|
|
|
|
|
|
|
| |
A future commit will want to use the first surrogate code point's UTF-8
value. Add this to the generated macros, and give it a name, since
there is no official one. The program has to be modified to cope with
this.
|
|
|
|
|
|
|
|
|
|
| |
A previous commit has caused macros to be generated that will match
Unicode code points of interest to the \X algorithm. This patch uses
them. This speeds up modern Korean processing by 15%.
Together with recent previous commits, the throughput of modern Korean
under \X has more than doubled, and is now comparable to other
languages (which have increased themselved by 35%)
|
|
The recently added utf8_strings.h has been expanded to include more than
just strings. I'm renaming it to avoid confusion.
|