| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Vim's filetype declarations are case sensitive. The correct types for
Perl, C, and Pod are perl, c, and pod, respectively.
|
|
|
|
|
|
|
|
|
|
| |
This updates the mode-line for most of our generated files so that
they include file type information so they will be properly syntax
highlighted on github.
This does not make any other functional changes to the files.
[Note: Commit message rewritten by Yves]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This splits a bunch of the subcomponents of the regex engine into
smaller files.
regcomp_debug.c
regcomp_internal.h
regcomp_invlist.c
regcomp_study.c
regcomp_trie.c
The only real change besides to the build machine to achieve the split
is to also adds some new defines which can be used in embed.fnc to control
exports without having to enumerate /every/ regex engine file. For
instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used
in embed.fnc to manage exports.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has lots of arrows of various shapes, sizes, and directions.
None of them were of consequence to the Bidirectional algorithm, so none
were specified as being mirrored pairs. This commit uses the
generalizations already in place from previous commits to examine arrow
symbols and choose which are mirrored pairs.
As previously, it rejects arrows with contrary directionality, and ones
without horizontal directionality.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
| |
The characters with this name look good as mirrored delimiters.
|
|
|
|
|
| |
The characters that signify the beginning and ending of Western music
scores serve as good delimiters
|
|
|
|
|
|
| |
The bidi-aware characters containing this word are visually suitable for
being mirrored delimiters. The 'index' refers to the index finger
in a hand pointing at the delimited string
|
|
|
|
|
| |
The bidi-aware characters containing this word are visually suitable for
being mirrored delimiters.
|
|
|
|
|
| |
The bidi-aware characters containing this word are visually suitable for
being mirrored delimiters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Heretofore, the code looking for paired string delimiters has looked at
punctuation, and a few symbols that Unicode gives a mirror for. But
there are many more suitable-for-pairing characters in Unicode.
This commit generalizes things so as to handle the extra complexities of
the way symbols are named beyond the punctuation names. For example,
RIGHTWARDS is sometimes used; it turns out that it also is used in one
punctuation character, which was previously overlooked by this script.
The generalization introduced by this commit handles almost all current
Unicode symbols properly.
But some symbols are barely distinguishable from their mirrors, such as
a tilde and a reversed tilde. The scheme adopted here, then, makes the
default for a symbol pair to not be marked as paired delimiters. The
code explicitly has to specify that a given pair is to be included.
The next few commits are mostly for adding ones that I thought were
good.
|
|
|
|
|
| |
This commit adds 8 pairs of symbols that are variants on ELEMENT OF
These make nice paired delimiters in the vein of < >
|
|
|
|
|
| |
This commit adds 20 pairs of symbols that are variants on SUBSET
These make nice paired delimiters in the vein of < >
|
|
|
|
|
| |
This commit adds 15 pairs of symbols that are variants on PRECEDES.
These look a lot like <>, so makes sense to make them paired delimiters.
|
|
|
|
|
| |
This commit adds 2 pairs of symbols that are variants on SMALLER THAN.
These look a lot like <>, so makes sense to make them paired delimiters.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, only the punctuation characters that Unicode had classed as
being opening/closing were considered in looking for suitable paired
delimiters.
This commit looks at all punctuation characters. There are actually
only 7 new pairs found.
This gives us ꧁ ꧂ as string delimiterss, if your font allows,
which are Javanese and used to surround an honorific title, according to
Wikipedia.
|
|
|
|
|
|
|
| |
Perl considers '< >' to be delimiters for strings; this commit adds
most of the Unicode variants of these to also be string delimiters. The
ones that are combinations of both < and >, aren't included, as that
would be visually confusing.
|
|
|
|
|
|
|
|
| |
Besides LEFT/RIGHT, horizontal directionality can be specified by
Unicode in names by the presence or absence of REVERSED.
Enhancing the algorithm to take this into account adds 2 pairs or
mirrored delimiters that were previously overlooked.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, only characters that Unicode included in its bidirectional
algorithm have been eligible to be found by this program to be mirrored
string delimiters.
This commit adds 5 quotation marker character pairs that
are omitted from the bidirectional algorithm, as most quotes are,
because, as the Standard says, their "directionality and pairing status
is less predictable than paired brackets."
But we're not particularly interested in those semantics, most string
delimiters will be selected only for their visual appearance.
Because they aren't in the bidi algorithm, there is no property that
maps one member of a pair to its mate. However, Two characters whose
names pair only by LEFT vs RIGHT are almost certainly a mirrored pair.
This doesn't catch all possibilities; future commits will expand the
ones caught.
The commit refactors things so as to make future commits easier which
look at even more delimiter possibilities.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode says certain opening punctuation characters may be used as
closing ones in some languages; and their mirror is instead the opening
one.
This commit changes to allow either one of each such set to be the
opening one.
It also deprecates the use of any of the new mirrored delimiters to be
used outside the feature as an unmirrored delimiter, and the normal
closing delimiter from being used as an unpaired opening one while in
the feature. This gives us the freedom to make some or all of the new
paired delimiters be reversible.
|
|
|
|
|
|
|
|
|
| |
This commit causes several C strings to be generated containing bytes
that match paired string delimiters beyond the four that have
traditionally been used in Perl. This will allow a future commit to
accept more matching delimiters around strings than those four.
The code explains how the added delimiters are chosen.
|
|
|
|
| |
Align the output of this bit vertically with surrounding output.
|
| |
|
|
|
|
|
|
| |
The names were intended to force people to not use them outside their
intended scopes. But by restricting those scopes in the first place, we
don't need such unwieldy names
|
|
|
|
|
|
|
|
|
|
|
| |
This feature allows documentation destined for perlapi or perlintern to
be split into sections of related functions, no matter where the
documentation source is. Prior to this commit the line had to contain
the exact text of the title of the section. Now it can be a $variable
name that autodoc.pl expands to the title. It still has to be an exact
match for the variable in autodoc, but now, the expanded text can be
changed in autodoc alone, without other files needing to be updated at
the same time.
|
|
|
|
| |
which will be needed in a future commit
|
|
|
|
|
| |
apidoc_section is slightly favored over head1, as it is known only to
autodoc, and can't be confused with real pod.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has changed its yearly release cycle so that the final version
is not available until early March of the year. This year it is March
10, 2020.
However, all changes planned were finalized in early January, and the
actual computer files have been updated to their presumably final
substantive versions. The release has been authorized without further
review needed.
The release is awaiting final documentation additions, and soak time of
the beta to verify there are no glitches. This commit causes Perl to
participate in that soak.
I don't anticipate any problems, and likely the only substantive change
upon the official release will be to update perldelta. Comments in the
files supplied by Unicode will likely also change to indicate these are
no longer beta.
There were very few changes affecting existing characters; most of the
changes involved adding new characters, including emoji. The break
characteristics of some existing characters were changed (GCB, LB, WB,
and SB properties). The only perl code I really had to change to cope
with the new release was about rules in the Line Break property, dealing
around ellipses (...) and certain East Asian characters next to opening
parentheses.
If there are problems, we can revert this at any time, and ship with
12.0.
|
|
|
|
|
|
|
|
|
|
| |
This makes various fixes to the text that is used to generate the
documentation. The dominant change is to add the 'n' flag to indicate
that the macro takes no arguments. A couple should have been marked
with a D (for deprecated) flag, and a couple were missing parameters,
and a couple were missing return values.
These were spotted by using Devel::PPPort on them.
|
| |
|
|
|
|
|
|
|
| |
IBM says that there are 13 characters whose code point varies depending
on the EBCDIC code page. They fail to mention that the \n character may
also vary. This commit adds checks for \n, in addition to the checks
for the 13 graphic variant ones.
|
|
|
|
| |
Unicode 12.0 is finalized. Change to use it.
|
|
|
|
| |
This will be needed in a future commit
|
|
|
|
|
|
|
|
|
|
| |
Like the previous commit, this code is adding the UTF-8 for a Greek
character to a string. It previously used Copy, but this character is
representable as two bytes in both ASCII and EBCDIC UTF-8, the only
character sets that Perl will ever supports, so we can use the
specialized code that is used most everywhere else for two byte UTF-8
characters, avoiding the function overhead, and having to treat this
character as particularly special.
|
|
|
|
|
|
|
|
|
| |
This code is adding the UTF-8 for a Greek character to a string. It
previously used Copy, but this character is representable as two bytes
in both ASCII and EBCDIC UTF-8, the only character sets that Perl will
ever supports, so we can use the specialized code that is used most
everywhere else for two byte UTF-8 characters, avoiding the function
overhead, and having to treat this character as particularly special.
|
|
|
|
| |
This completes the process of upgrading to Unicode 11.0.
|
|
|
|
|
| |
We need the length of the UTF-8 for this code point elsewhere, and it
is different between ASCII and EBCDIC.
|
|
|
|
|
|
| |
The new file from Unicode "extracted/DerivedName.txt" is not delivered
here, as Perl doesn't need it, as it duplicates information in other
files.
|
|
|
|
|
|
|
|
|
|
| |
We changed to use symbols not likely to be used by non-Perl code that
could conflict, and which have trailing underbars, so they don't look
like a regular Perl #define.
See https://rt.perl.org/Ticket/Display.html?id=131110
There are many more header files which are not guarded.
|
|
|
|
|
| |
This makes it easy for module authors to write XS code that can use
these characters, and be automatically portable to EBCDIC systems.
|
|
|
|
|
| |
This includes regenerating the files that depend on the Unicode 9 data
files
|
|
|
|
|
|
|
|
|
|
|
| |
This commit comments out the code that generates these tables. This is
trivially reversible. We don't believe anyone is using Perl and
POSIX-BC at this time, and this saves time during development when
having to regenerate these tables, and makes the resulting tar ball
smaller.
See thread beginning at
http://nntp.perl.org/group/perl.perl5.porters/233663
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed in the previous commit, most code points in Unicode
don't change if upper-, or lower-cased, etc. In fact as of Unicode
v8.0, 93% of the available code points are above the highest one that
does change.
This commit skips trying to case these 93%. A regen/ script keeps track
of the max changing one in the current Unicode release, and skips casing
for the higher ones. Thus currently, casing emoji will be skipped.
Together with the previous commits that dealt with casing, the potential
for huge memory requirements for the swash hashes for casing are
severely limited.
If the following command is run on a perl compiled with -O2 and no
DEBUGGING:
blead Porting/bench.pl --raw --perlargs="-Ilib -X" --benchfile=plane1_case_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after
and the file 'plane1_case_perf' contains
[
'string::casing::emoji' => {
desc => 'yes swash vs no swash',
setup => 'my $a = "\x{1F570}"', # MANTELPIECE CLOCK
code => 'uc($a)'
},
];
the following results are obtained:
The numbers represent raw counts per loop iteration.
string::casing::emoji
yes swash vs no swash
before_this_commit after
------------------ --------
Ir 981.0 306.0
Dr 228.0 94.0
Dw 100.0 45.0
COND 137.0 49.0
IND 7.0 4.0
COND_m 5.5 0.0
IND_m 4.0 2.0
Ir_m1 0.1 -0.1
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
|
|
|
|
| |
The previous commit removed all uses of this non-public #define.
|
|
|
|
| |
These will be used in the next commit
|
|
|
|
|
| |
Future commits will want to take different actions depending on which
Unicode version is being used.
|