| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
Previously regcharclass.pl could tell if an input string was a
multi-character fold of some Unicode code point. This commit adds the
ability to return what that code point is. This capability will be used
in a later commit.
|
|
|
|
|
|
| |
Prior to this commit, only the upper case of Latin1 characters was dealt
with. But we really want case folding, and there are a few other
characters that fold to Latin1. This commit acknowledges them.
|
|
|
|
| |
Outdent and remove lines from changes in the previous commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, the generated macros for dealing with multi-char
folds in UTF-8 strings only recognized completely folded strings. This
commit changes that to add the uppercase for characters in the Latin1
range. Hopefully an example will clarify.
The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i'
followed by U+0307: COMBINING DOT ABOVE. But since we are doing /i
matching, an 'I' followed by U+307 should also match. This commit
changes the macros to know this. Before this, if the fold were entirely
ASCII, the macros would know all the possible combinations. This commit
extends that to all code points < 256. (Since there are no folds to the
upper latin1 range), that really means all code points below 128. But
making it general means it wouldn't have to be revised if a fold were
ever added to the upper half range.)
The reason to make this change is that it makes some future code less
complicated. And it adds very little complexity to the generated
macros; less than the code it will save. I originally thought it would
be more complext than it now turns out to be. Much of that is because
the infrastructure has advanced since that decision.
I couldn't find any current places that this change will allow to be
simplified. There could be if the macros were extended to do this on
all code points, not just the low ones. I tried that, but the generated
macros were at least 400 lines longer than before. That does add
significant complexity, so I backed that out.
|
|
|
|
|
|
| |
These macros will be used in a future commit, and are for
three-character folds. regen/regcharclass*.pl are changed for this
purpose.
|
|
|
|
|
| |
This creates a simply named array instead of a more complicated array
ref, so is easier to understand
|
|
|
|
|
|
|
|
| |
It makes the result more legible if it uses the printable character
instead of an escape sequence when appropriate.
Although, currently, the value is re-escaped for output. This helped
during debugging.
|
| |
|
|
|
|
|
|
|
|
| |
Committer: For porting tests: Update $VERSION in 4 files.
Run:
./perl -Ilib regen/mk_invlists.pl
./perl -Ilib regen/regcharclass.pl
|
|
|
|
|
| |
Several places require special handling because of this, notably for the
lowercase Sharp S, but not in Unicodes before 3.0.1
|
|
|
|
|
| |
This bit code is not about just ASCII folds, so skip it when doing just
those.
|
| |
|
|
|
|
|
| |
These messages say the output number is Unicode, but it is really
native, so change to saying is 0xXXXX.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
use locale;
fc("\N{LATIN CAPITAL LETTER SHARP S}")
eq 2 x fc("\N{LATIN SMALL LETTER LONG S}")
should return true, as the SHARP S folds to two 's's in a row, and the
LONG S is an antique variant of 's', and folds to s. Until this commit,
the expression was false.
Similarly, the following should match, but didn't until this commit:
"\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa
The reason these didn't work properly is that in both cases the actual
fold to 's' is disallowed. In the first case because of locale; and in
the second because of /aa. And the code wasn't smart enough to realize
that these were legal.
The fix is to special case these so that the fold of sharp s (both
capital and small) is two LONG S's under /aa; as is the fold of the
capital sharp s under locale. The latter is user-visible, and the
documentation of fc() now points that out. I believe this is such an
edge case that no mention of it need be done in perldelta.
|
|
|
|
|
|
|
|
| |
This changes the macro isMULTI_CHAR_FOLD() (non-utf8 version) from just
generating ascii-range code points to generating the full Latin1 range.
However there are no such non-ASCII values, so the macro expansion is
unchanged. By changing the name, it becomes clearer in future commits
that we aren't excluding things that we should be considering.
|
|
This takes as input the current Unicode character data base, and outputs
lists of the multi-character folds in it, in a form suitable for input
to regen/regcharclass.pl
|